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THESIS DISCLAIMER 

The computer programs developed in this research have not been exercised for 
all cases of interest. Every reasonable effort has been made to eliminate computa- 
tional and logical errors, but the programs should not be considered fully verified. 
Any application of these programs without additional verification is at the user’s 
risk. A reasonable effort has been put forth to make the code efficient. Optimization 
has been suppressed, however, in areas where it would jeopardize the simplicity and 
clarity of the algorithm without great reward in terms of performance. 
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I. PREFACE 



The need for speed accompanied by reliability has driven many advances in machine 
design. The history of computing is replete with examples — many from scientific 
fields — where necessity became the impetus for faster, more reliable machinery. 
Without exception, history and past designs have played key roles in the invention 
of ne\v equipment. The maturity of mechanical calculator design was foundational 
in the construction of electronic computers. Today’s multiprocessor computers are 
extensions of uniprocessor machines and include technology developed by our tele- 
phone industry. Many well-worn tools and lessons from the past can be applied. 
Many new ideas must be put to the test. This thesis is about applying old principles 
and evaluating new tools and equipment. 

A. A SURVEY OF COMPUTING MACHINERY 

Nothing is more important than to see the sources of invention, which arc, 
in my opinion, more interesting than the inventions themselves. 

— GOTTFRIED WILHELM LEIBNIZ (1646-1716) 

1. Beginnings 

The history of mathematics and computing is as old as civilization. Tools 
like the abacus have been used to simplify arithmetic problems. Wilhelm Schickhard 
(1592-1635), Blaise Pascal (1623-1662), and Gottfried Wilhelm Leibniz designed and 
built mechanical, gear-driven calculators. The latest of these was essentially a four- 
function calculator. By the mid-1800s, Charles Babbage had designed his Difference 
Engine and proceeded to the more advanced Analytical Engine. These machines were 
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never completed (at least not to the grand scale that Babbage planned), but the basic 
design of the Analytical Engine lies at the heart of any modern computer. Consider 
his motivation. 

The following example was frequently cited by Charles Babbage ( 1792-1871 ) 
to justify the construction of his first computing machine, the Difference Engine 
[Ref. 1]. In 179 4 a project was begun by the French government under the direction 
of Baron Gaspard de Prony (1755-1839) to compute entirely by hand an enormous 
set of mathematical tables. Among the tables constructed were the logarithms of 
the natural numbers from 1 to 200,000 calculated to 19 decimal places. Comparable 
tables were constructed for the natural sines and tangents, their logarithms, and the 
logarithms of the ratios of the sines and tangents to their arcs. The entire project 
took about 2 years to complete and employed from 70 to 100 people. The mathemat- 
ical abilities of most of the people involved were limited to addition and subtraction. 

A small group of skilled mathematicians provided them with their instructions. To 
minimize errors, each number was calculated twice by two independent human cal- 
culators and the results were compared. The final set of tables occupied 17 large 
folio volumes ( which were never published, however). The table of logarithms of the 
natural numbers alone was estimated to contain about 8 million digits. 

This quote, from Hayes [Ref. 2 : p. 1], helps to explain why computers 
exist and shows some of the incentive for making them better. Computing ma- 
chinery is designed for speed and reliability. A computer’s “performance” should 
be measured against both of these components. Speed normally receives the most 
attention. Reliability, by whatever label you choose to give it, rarely receives due 
(and/or timely) attention. Too often errors and issues of correctness receive careful 
consideration in reactive — not proactive — situations. Kahan says, “The Feist drives 
out the Slow even if the Fast is wrong” [Ref. 3: p. 596). 

The correctness side of performance is a much tougher game; and reliability 
can be a fairly subjective matter. Often we pursue solutions that are “good enough” 
(and this cannot always be defined). Time, on the other hand, has well-defined units 
and the standards for measuring time enjoy a history as old as the first sunrise. The 
ease with which the programmer can access the machine’s clock makes measurements 
of this side of performance somewhat easier. 
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Figure 1.1: Technologies and Computing Speed 



Industry demands fast machines because “time is money” and speed alone 
can make difficult, time-consuming problems tolerable. Without doubt, the speed 
of a processor and execution time are important performance considerations. But 
speed is partly dependent upon technology. Babbage’s designs represented quite an 
advance, but they could not be realized in his day. Technology can determine which 
designs succeed, and to what extent. Figure 1.1 compares several recent technologies 
using speed (measured in operations per second) as the yardstick. The data for this 
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illustration was taken from Hayes [Ref. 2: p. 9]. As the figure indicates, it was nearly 
a century after Babbage’s work when major technological advances came about. 

2. Electricity 

Significant gains in speed were made possible when electricity could be used 
in computer engineering. The United States census of 1890 employed punched cards 
that were read using electricity and light. Herman Hollerith (1860-1929), the de- 
signer of these cards, formed a company that would later join others and (in 1924) 
take on the name International Business Machines Corporation. Punched paper tape 
was later used by IBM in the Harvard Mark I, a general-purpose electromechani- 
cal computer designed by Howard Aiken (1900-1973). In the late 1930s, at Iowa 
State University, John V. Atanasoff was creating a special-purpose machine to solve 
systems of linear equations. He is credited with “the first attempt to construct an 
electronic computer using vacuum tubes” [Ref. 2: p. 16]. 

In 1943, J. Presper Eckert and John W. Mauchly began w T ork — at the Uni- 
versity of Pennsylvania — to direct the creation of “the first widely known general- 
purpose electronic computer”. The Electronic Numerical Integrator and Calculator 
(ENIAC) project w'as funded by the U. S. Army Ordnance Department. The 30-ton 
machine was completed in 1946. It held more than 18,000 vacuum tubes. It could 
perform a ten-digit multiplication in three milliseconds, three orders of magnitude 
faster than the Harvard Mark I. [Ref. 2: pp. 17-18] 

3. First Generation Computers 

From Babbage’s Analytical Engine to ENIAC, computer architectures held 
data and programs in separate memories. In 1945, John von Neumann (1903-1957) 
proposed the stored-program concept (i.e., programs and data could be stored in 
the same memory unit). The Hungarian-born mathematician’s involvement in the 
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EN1AC project is not remembered by many, but the “von Neumann architecture” 
has become commonplace. In fact, it “has become synonymous with any computer 
of conventional design independent of its date of introduction” [Ref. 2 : p. 31). 
Hennessy and Patterson [Ref. 3 : pp. 23-24] object to the widespread use of this 
term, claiming that Eckert and Nlauchly deserved more of the credit. 

In 1946, von Neumann (and others) began to design such an architecture 
at the Institute for Advanced Studies (IAS), Princeton. This machine, now called 
the IAS computer, is representative of so-called first-generation computers (as Hayes 
points out: “a somewhat short-sighted view of computer history”). The IAS machine 
was roughly ten times faster than ENIAC [Ref. 3: p. 24], During the 1946-194S 
timeframe, A. W. Burks, H. H. Goldstine, and John von Neumann wrote a series of 
reports describing the IAS design and programming. The advances and refinements 
in computer design that came out of this period were important and lasting. By 
1950, von Neumann and his colleagues had formed a foundation of theory and design 
worthy of advanced technology. [Ref. 2: pp. 19-20] 

4. Transistors 

The change from vacuum tube to transistor technology marked the begin- 
ning of the “second-generation” of computers (approximately 1955-1964). Transis- 
tor technology provided faster switching elements, but this was not the only change 
of the decade. Many of the plans of the late forties and early fifties involved memory, 
so it was fitting that ferrite cores and magnetic drums be used for faster main mem- 
ories. Changes such as these led Hennessy and Patterson to conclude that “cheaper 
computers” were the principal new product of the early 1960s [Ref. 3: p. 26]. 

Additionally, machines began to become more sophisticated. The space and 
tasks of the central processing unit (CPU) and main memories were decentralized 
with the advent of special-purpose processors to augment the CPU and special- 
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purpose memories (e.g., registers) to augment the main memory. Finally, system 
software was becoming a greater issue. Programming continued moving upward, 
away from the machine level, and the processing of batch jobs was becoming more 
automated. [Ref. 2: pp. 31-32] 

5. Integrated Circuits 

The first integrated circuit (IC) was introduced in 1961 [Ref. 4 : p. 1], and 
the use of ICs would be among the most significant advances evident in third- 
generation computers (starting about 1965). Integrated circuits brought major 
changes in cost, maintenance, reliability, and the amount of real estate required. 
Other than these hardware improvements (circuits and memory), third-generation 
computing was not easy to distinguish from that of the second generation. There was 
some migration from hardware to software (e.g., microprogramming), more special- 
ized and compartmentalized CPUs (e.g., pipelining), and system software continued 
to advance (e.g., operating systems that could support multiprogramming through 
“time-slicing”). [Ref. 2 : p. 40] 

6. Instruction Set Trade-Offs 

A large part of designing computer hardware and software involves analysis 
of cost-performance ratios. Other than genuine advances in design or technology, 
almost every aspect of computer architecture involves trade-offs. There is usually 
a spectrum of options from which the computer architect chooses, and the “best” 
solutions are not always found near the ends of the spectrum. Performance can rarely 
be optimized with respect to both space and time , so a balance must be sought. This 
space-time conflict and others appear when a designer must select a sophisticated 
instruction set, or a very simple one, or one of the many options along the spectrum 
between these options. 
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In the late 1970s and early 19S0s both hardware and software became pro- 
gressively more sophisticated. Instructions became longer and more complex. The 
Complex Instruction Set Computer (CISC) was popular. This design has the advan- 
tage of powerful instructions, but the machine must decode each instruction (it is 
a binary code). The decoding process favors brevity because longer instructions re- 
quire more levels of decoding circuitry. Nonetheless, if the longer instructions could 
carry enough meaning, the decoding endeavor would be justified. 

IBM researchers uncovered a provocative statistic — 20% of the instruction 
set was carrying 80% of the burden [Ref. 5: p. 5]. The instruction set had become 
too complex. With some help from several researchers and IBM, the Reduced In- 
struction Set Computer (RISC) architecture became popular. RISC machines admit 
a smaller vocabulary, but claim quicker comprehension. In fact, the goal of the RISC 
architectures is one-cycle execution of the instructions [Ref. 5: pp. 6-7]. Hennessy 
and Patterson, both key contributors to the RISC movement, give an indication of 
the current broad acceptance of the RISC architecture [Ref. 3: p. 190]: 

Prior to the RISC architecture movement, the major trend had been highly 
microcoded architectures aimed at reducing the semantic gap. DEC, with the VAX, 
and Intel, unth the iAPX 432, were among the leaders in this approach. In 1989, 
DEC and Intel both announced RISC products — the DECstation 3100 (based on the 
MIPS Computer Systems R2000) and the Intel i860, a new RISC microprocessor. 
With these announcements, RISC technology has achieved very broad acceptance. 

In 1990 it is hard to find a computer company without a RISC product either 
shipping or in active development. 

Three major research projects were central to early RISC developments. The first — 
the IBM 801 — began in the late 1970s, under the direction of John Cocke. In 1980, 
David Patterson and his colleagues at the University of California at Berkeley began 
the RISC-I and RISC— II projects for which the architecture is named. Finally, John 
Hennessy and others at Stanford University “published a description of the MIPS 
machine” in 1981. [Ref. 3: p. 189] 
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7. Multiprocessors and Multicomputers 



The most recent advances in the design of computing machinery include 
parallel and concurrent architectures. The terminology associated with these ma- 
chines has been developing for about twenty-five years, but it is still immature. 
The terms “multiprocessor” and “multicomputer”, for instance, are sometimes used 
with additional meaning. C. Gordon Bell proposes that an MIMD machine with 
message passing and no shared memory be called a multicomputer. He calls a 
shared-memory MIMD machine a multiprocessor [Ref. 6: p. 1092]. This termi- 
nology seems to be on the way to acceptance, and it seems useful in giving a general 
characterization to many systems, but it lacks the sort of precision that may be 
necessary. 

First, the word “computer” usually carries many expectations with it. From 
a computer, we expect things like input and output facilities, peripheral devices, and 
so on. These are things that a node on a typical “multicomputer” does not always 
possess. A “processor” is just the opposite. It might be just about any sort of 
processor and we are cautious about attaching any expectations to the term. Many 
processors are special-purpose machines, but (more substantial) central processing 
units and arithmetic logic units are also numbered among processors. The terms 
“computer” and “processor” are not precise. 

Secondly, by automatically associating Flynn’s taxonomy, memory mod- 
els (e.g., shared, distributed), and other things with a terminology, we reduce their 
importance and hide them behind the term. By using the term “multicomputer”, 
without careful definition up front, we run the risk of forgetting that we are talking 
about an MIMD machine that uses message passing and has no shared memory. Ad- 
ditionally, this terminology — packed with expectations — ignores an entire spectrum 
of very real possibilities. Are we saying that a machine cannot employ a combination 
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of shared and distributed memory? Using this terminology, how would we say that 
the memory available to each node of a given system was 30 percent shared and 70 
percent local (distributed)? 

Nevertheless, the terms have some use, provided we don’t expect too much 
of them. After all, we distinguish cars from trucks in everyday conversation with 
reasonably little confusion. But — in the same way that it is not prudent to assume 
that “car” implies a vehicle equipped with a V-8 engine and four doors — we should 
be careful to guard against packing too many specifics and expectations into the 
terms “multiprocessor” and “multicomputer.” For this reason, the terms multipro- 
cessor and multicomputer are used almost interchangeably in this work. A conscious 
effort is made to support them with a clear description of the memory paradigm, 
communications facilities, and so on. 

Bell's terminology identifies the systems used in this work (iPSC/2 and 
transputer networks) as multicomputers. Nevertheless, I often use the term “mul- 
tiprocessor” to identify a system with more than one processor (such as the ones 
described in Chapter V and Appendix B). That is, multiprocessor means nothing 
more than the expected combination of “multi” with “processor.” To forestall confu- 
sion, the rest of the thesis pertains to distributed memory machines that use message 
passing to communicate instructions and data between nodes. 

8. Uniprocessors and Multiprocessors 

At the chip level, multiprocessor systems resemble their single-processor 
predecessors. Experience (e.g., telephone industry, electronic technology) and a foun- 
dation of theory and design (e.g., von Neumann’s work, network theory) are distinct 
benefits in the development of equipment and techniques for distributed and parallel 
computing. From a system perspective, though, the concurrent use of more than one 
processor creates a fundamentally different environment. 
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Uniprocessor systems differ substantially from multiprocessors and multi- 
computers in their ability to access data without competition. In the presence of 
more than one processor — regardless of memory model — there is a need to coordinate 
requests for data. This means that the multicomputer must accommodate interpro- 
cessor communications. The nodes of a multiprocessor system must work together 
efficiently to justify the cost of the resulting system. Some parts of the solution are 
relatively mature, but a vast territory — algorithms, electronic components, media 
for communication, and software engineering techniques — begs further exploration. 

B. CURRENT APPROACHES 
1. Machines 

To compare the capabilities of different machines, some method of bench- 
marking is typically used. By timing the execution of a certain program(s) on a given 
machine we can determine its performance for the given problem. By comparing the 
execution times for the same problem(s) on different machines, we arrive at a notion 
of their relative power. A popular method for sizing up the computing power of 
a machine is the LINPACK benchmarking program [Ref. 7]. This is essentially a 
program involving the solution of a dense system of linear equations. 

Currently, under this LINPACK test, the fastest machines in the world 
have surpassed the gigaflop mark (a billion floating-point operations per second). 
Table 1.1, adapted from Dongarra’s report [Ref. 8: p. 21], shows performance data. 
The leftmost column of this table gives the name of the system and the cycle time (in 
parentheses). The next column contains p, the number of processors used to obtain 
the data that is shown in the four remaining columns. For most systems (e.g., the 
Intel iPSC/860) the size of the system (number of processors used for a given run) 
can be scaled, so data was reported for several different system sizes. 



10 



TABLE 1.1: WORLD’S FASTEST COMPUTERS 



Computer (Clock Rate) 


P 


^max 


^mai 


”1/2 


** peak 


Intel Delta (40 MHz) 


512 


11.9 


25000 


7000 


20 


Thinking Machines CM-200 (10 MHz) 


2048 


9.0 


28672 


11264 


20 


Intel Delta (40 MHz) 


256 


5.9 


18000 


5000 


10 


Thinking Machines CM-2 (7 MHz) 


2048 


5.2 


26624 


11000 


14 


Intel Delta (40 MHz) 


192 


4.0 


12000 


4000 


7.7 


Intel Delta (40 MHz) 


128 


3.0 


12500 


3500 


5 


Intel iPSC/860 (40 MHz) 


12S 


1.9 


8600 


3000 


5 


nCUBE 2 (20 MHz) 


1024 


1.9 


21376 


3193 


2.4 


Intel Delta (40 MHz) 


64 


1.5 


8000 


3000 


2.6 


nCUBE 2 (20 MHz) 


512 


.958 


15200 


2240 


1.2 


Intel iPSC/860 (40 MHz) 


64 


.928 


5750 


2500 


2.6 


Fujitsu AP1000 


512 


2.251 


25600 


2500 


2.8 


Intel iPSC/860 (40 MHz) 


32 


.486 


4000 


1500 


1.3 


nCUBE 2 (20 MHz) 


256 


.482 


10784 


1504 


.64 


MasPar MP-1 (80 ns) 


16384 


.44 


5504 


1180 


.58 


Fujitsu AP1000 


256 


1.162 


18000 


1600 


1.4 


Intel iPSC/860 (40 MHz) 


16 


.258 


3000 


1000 


.64 


nCUBE 2 (20 MHz) 


128 


.242 


7776 


1050 


.32 


Fujitsu AP1000 


128 


.566 


12800 


1100 


.71 


Intel iPSC/860 (40 MHz) 


8 


.132 


2000 


600 


.32 


nCUBE 2 (20 MHz) 


64 


.121 


5472 


701 


.15 


Fujitsu AP1000 


64 


.291 


10000 


648 


.36 


Intel iPSC/860 (40 MHz) 


4 


.061 


1000 


400 


.16 


nCUBE 2 (20 MHz) 


32 


.0611 


3888 


486 


.075 


Intel iPSC/860 (40 MHz) 


2 


.044 


1000 


400 


.08 


nCUBE 2 (20 MHz) 


16 


.0320 


5580 


342 


.038 


Intel iPSC/860 (40 MHz) 


1 


.024 


750 




.04 


nCUBE 2 (20 MHz) 


8 


.0161 


3960 


241 


.019 


nCUBE 2 (20 MHz) 


4 


.0080 


2760 


143 


.0094 


nCUBE 2 (20 MHz) 


8 


.0040 


1280 


94 


.0047 


nCUBE 2 (20 MHz) 


8 


.0020 


1280 


51 


.0024 
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The column labeled r max gives the performance (in gigaflops) for the largest 
problem run on the machine. The size of that largest problem is indicated by n max , 
where n is the dimension of the matrix of coefficients, A € 3? nXn . The nj/ 2 column 
gives the problem size that yielded a rate of execution that was half of r max . Finally, 
r peak denotes the theoretical peak performance (in gigaflops) for the machine. 

This data indicates that Intel is the current leader — among companies in 
the United States — of the teraflop race, so we shall take a closer look at their prod- 
ucts. The Intel i860 microprocessor, together with 8 megabytes of memory, forms 
one of 128 nodes in the hypercube-connected iPSC/860. This machine achieves per- 
formances of nearly two gigaflops with UNPACK. iPSC stands for int e l Personal 
Supercomputer, so this entry would not appear to target high-end markets. The 
most significant project in supercomputing at Intel today is the Touchstone project. 

George E. Brown, chairman of the U. S. House Committee on Science, 
Space, and Technology, cut the ribbon around the Intel Touchstone Delta at the 
California Institute of Technology on May 31, 1991 [Ref. 9 : p. 96]. The Delta 
is a mesh of 52S nodes. Each node holds an i860 processor and 16 megabytes of 
memory. This machine has reached the 11.9 gigaflop mark with the UNPACK 
benchmark. The closest competitor in the world would appear to be the CM-200 
from Thinking Machines, Inc. This 2,048-node machine benchmarks at 9 gigaflops 
[Ref. 8: p. 21]. The Touchstone program is not over. Intel plans to follow the Delta 
with the Touchstone Sigma. Sigma will have at least 2,048 nodes, each consisting of 
the i860 XP processor (about twice as powerful as the i860). [Ref. 9: p. 96] 

The European high-performance computing market favors the transputer, 
a microprocessor made by INMOS. The New York Times of May 31, 1991 lists one 
German company, Parsytec, and seven American companies — Bolt, Beranek, and 
Newman (BBN), Cray Research, IBM, Intel, NCube, Thinking Machines, and Tera 
Computer — that have entered the teraflop race [Ref. 10]. Parsytec expects their GC 



12 



to provide “the necessary 2 to 3 orders of magnitude increase in performance above 
existing supercomputers to give scientists the tool to attack their Grand Challenges." 
[Ref. 10: p. 1] 

Parsytec envisions a system of up to 16,384 processing elements based upon 
the INMOS T9000 transputer (see Chapter VII). This would give the Parsytec ma- 
chine 25-megaflop nodes capable of communications bandwidths near 100 megabytes 
per second. The Parsytec design begins with a cluster of seventeen T9000 processors 
(sixteen primary processors and the seventeenth for backup) and four C104 worm- 
hole routing chips. From four clusters, the company will craft a GigaCube (or simply 
Cube) of 64 processors (not counting redundant elements in the design). The GC- 
1 would represent a one gigaflop system and this would be the building block for 
greater systems (lesser systems can initially be equipped with 16, 32, or 48 nodes). 
The processors in a single (Giga)Cube are arranged in a three-dimensional (4x4x4) 
grid. [Ref. 10] 

2. Programming Practice 

Software engineering for multiprocessor systems is similar to contemporary 
practices for sequential machines. The programming languages used in this work 
provide normal C libraries with additional functions to accommodate interprocessor 
communications. The systems typically provide a loader designed to load executable 
code onto the (host and) nodes according to the programmer’s instructions. Some 
loaders require that the same code be loaded onto each of the nodes. Other, more 
flexible, loaders allow the user to specify which program should be loaded onto each 
node. The Logical Systems C network loader, LD-NET is such a program. It takes 
a Network Information File (NIF), describing the network’s interconnections and 
loading instructions, as input and performs the loading process. 
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C. THE FUTURE 



1. Crossroads 



Parallel and distributed computing is in the early years of a very promising 
lifetime. We should give careful consideration to the direction that the field should 
assume. Lacking years of experience, I will lean on the writings and advice of others 
while trying to peer a little ways into the future of parallel computing. A regrettable 
side effect of this decision is that this section seems to consist primarily of the 
observations and opinions of others. Notwithstanding the many quotations, I believe 
that several important ideas are exposed. 

This business is filled with a combination of old, established ideas and 
proven techniques. It also holds new questions and opportunities. Hamming’s ad- 
vice [Ref. 11 : p. 14] seems most fitting in this situation: 

Now I see constantly attempts to force new ideas to old molds. That is fre- 
quently sensible: How can I make sense of what I’m seeing compared to what I did 
before? But also one must ask, ‘‘Am I seeing something fundamentally new?” That 
part many people will not try. You cannot afford to make everything brand new and 
not connect anything together with existing ideas, nor can you try to make every- 
thing fit into preconceived categories. Some combination of the two is necessary. 

We limped through the transistor revolution and the computer revolution, 
which are connected with the bandwidth revolution; they are all connected together . . . 

You have to abandon old ideas when you get an order of magnitude of change. . . . 

— RICHARD W. HAMMING 

Developments in scientific computing today make Dr. Hamming’s thoughts 
especially timely. The field needs to establish a strategy; a direction that will lead 
from its present immaturity to a place of fulfilling its potential. Kenneth Wilson 
proposes Grand Challenges for computational science that may help to establish this 
strategy [Ref. 12]. 
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2. Grand Challenges 



Wilson identifies three modes of scientific activity: theoretical, experi- 
mental, and computational. He defines these areas, claiming that — with today’s 
supercomputers — the most recent science (computational) is becoming more signifi- 
cant. So significant, in fact, that “long experience or professional training is required 
to be successful in computational science at the supercomputer level, making it ap- 
propriate to think of computational science as both a separate mode of scientific 
endeavor and new discipline.” [Ref. 12: p. 172] 

Wilson is careful to distinguish computational science from computer sci- 
ence. He defines computer science as the business of addressing “generic intellectual 
challenges of the computer itself” and characterizes computational science as being 
tailored to specific applications areas (with serious training in the application disci- 
pline) [Ref. 12: p. 172]. To advance computational science, Wilson recommends a 
quantitative approach with clear strategies [Ref. 12: p. 173]: 

The major future opportunities for benefits of supercomputers to basic re- 
search should be identified without the existing compromises, but presented as chal- 
lenges to be overcome with the many obstacles to success clearly explained. The 
compromises and inadequacies of current computations need to be described and 
the level of advances required to overcome these inadequacies discussed. Further- 
more, a few key areas with both extreme difficulties and extraordinary rewards for 
success should be labelled as the “ Grand Challenges of Computational Science ”. 
Two examples are electronic structure and turbulence. No easy promises of success 
in Grand Challenges should be offered. Instead, computational scientists should be 
building plans to assault the Grand Challenges, pushing for the major advances 
in algorithms, software, and technology that will be required for true progress to 
be achieved in these areas. The Grand Challenges should define opportunities to 
open up vast new domains of scientific research, domains that arc inaccessible to 
traditional experimental or theoretical modes of investigation. 



Wilson describes a few examples that demonstrate the limitations of exper- 
imental instrumentation and the potential of supercomputers. Weather prediction, 
astronomy, materials science, molecular biology, aerodynamics, and quantum field 
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theory are the six areas that W ilson chooses to make his point. He describes these 
areas in reasonable detail and briefly mentions other topics. [Ref. 12: pp. 175-179] 

a. Mathematical Background 

Wilson stresses the need for sound design practices and good algorithms. 
(To see why, consider Table A.l). Additionally, he warns that we should spend less 
time in awe of today’s supercomputing power and admit that it is terribly inadequate. 
Modeling methods and sound mathematical background also appear in the “needs 
improvement” category. Wilson [Ref. 12: p. 180] believes that 

Mathematical developments that relate to numerical computation are highly 
important. Theorems about numerical errors or sources of error, exact solutions 
and expansions, existence and uniqueness proofs and the like, can make a major dif- 
ference in establishing the credibility of a numerical computation. All too frequently 
there is too little mathematical understanding backing up numerical simulation. 

b. Issues of Quality 

Wilson does not consider these to be the only problems facing com- 
putational scientists. He believes that quality is endangered, primarily from two 
directions [Ref. 12: pp. 180-181]: 

• A tendency to stay on the safe, easy side; not wandering far from the position: 
“our calculation agrees with experiment.” 

• The quality of computational programs, measured against practical criteria, 
is lacking. The standards include rounding errors (e.g., catastrophic cancella- 
tion), overflows, and stability (with respect to input parameters). 
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c. Languages 



Wilson cites a number of reasons for revolutions in computer languages. 
In particular, he believes that “Fortran is in the long-term the most fundamental 
barrier to progress” [Ref. 12: p. 182]. His approach is realistic enough to recognize 
the vast investments of scientific communities in Fortran. The language cannot and 
should not be eliminated in a day. Nevertheless, it has very serious shortcomings. 
Some problems could be overcome by a Fortran preprocessor (the same idea as the C 
preprocessor). Other problems, like lack of support for abstraction and the unnatural 
exxlusion of basic mathematical symbols in the language, are not solved as easily. 
[Ref. 12: p. 182] 

Wilson does not recommend a simple change of language as the solution, 
but searches for deeper problems. He believes that the entire way that computational 
scientists and programmers think about and plan programs must change as well. 
After reading Wilson’s analysis of language problems, the basic impression that 
prevails is that we have an urgent need for general-purpose practices to replace 
patchwork, hit-or-miss, case-by-case solutions. 

3. Generality 

David Harel is also an advocate of the need for general purpose techniques. 
In the preface to his book [Ref. 13: p. viii] he warns: 

Curiously, there appears to be very little written material devoted to the sci- 
ence of computing and aimed at the technically oriented general reader as well as 
the professional. This fact is doubly curious in view of the abundance of precisely 
this kind of literature in most other scientific areas , such as physics, biology, chem- 
istry and mathematics, not to mention humanities and the arts. There appears to 
be an acute need for a technically detailed, expository account of the fundamen- 
tals of computer science; one that suffers as little as possible from the bit/byte or 
semicolon syndromes and their derivatives, one that transcends the technological 
and linguistic whirlpool of specifics, and one that is useful both to a sophisticated 
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layperson and to a computer expert. It seems that we have all been too busy with 
the revolution to be bothered with satisfying such a need. 

This idea is not unique. One of the other major proponents of general- 
purpose parallel computing is David May of INMOS. In an invited lecture at the the 
Transputing ’91 conference [Ref. 14], he highlighted features that general-purpose 
parallel hardware should deliver. Among the important components of a general 
approach, May included the following: 

• Scaling. Performance must scale with number of processors. Efficiency is 
partly dependent on problem size, but — with adequate problem size — systems 
of a thousand processors should be within technological reach. Each processor 
is expected to achieve 10 8 -10 9 flops. 

• Portability. This is almost synonymous with “general purpose.” May empha- 
sizes algorithms based upon features common to many machines, and which 
remain valid as technology evolves. He stresses that this general purpose par- 
allel architecture will benefit both the computer designer and the programmer. 
The designer will gain since the market will be somewhat predictable. The 
programmer’s code will work on several machines and hold a strong hope for 
working into future years. 

To achieve these goals, May proposes several guidelines. First, for a message passing 
system using p processors, the nodes must be capable of concurrent computing and 
communication. The interconnection topology must provide scalable throughput 
(linear in p) and bounded delay, probably log(p). Programs, May believes, should be 
written at as high a level as possible and make use of many processes. The algorithm 
should express the maximum possible parallelism. Much of May’s theory is based 
upon the structure of a hypercube interconnection topology (or virtual hypercube). 
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4. Projections 



Kenneth Wilson makes a credible claim that says parallel computing is 
here to stay. His reasoning is based upon the fact that mass production and heavy 
competition are proven ingredients in keeping the cost of chips low. Rather than 
summarize, I will quote his conclusion [Ref. 12: p. 185]: 

Today a single processing unit costing millions of dollars can still be cost- 
effective but I don’t think this can last very long, over a period of time (I cannot 
estimate how many years) it seems likely that the maximum price of a cost-effective 
processor will plunge to one hundred thousand dollars, to ten thousand dollars, to 
???. I cannot estimate the ultimate equilibrium price at which this plunge unll stop. 

Meanwhile I can find no prospects that single supercomputer processors speeds 
will advance at anything like the pace at which processor costs are being reduced, 
even using Gallium Arsenide or superconducting Joscphson junctions. 

The result of this is inevitable — overall advances at the supercomputer level 
have to come through parallelism, namely, big increases in speed have to come from 
the simultaneous use of many processors in parallel. 



David May agrees with Wilson, who states that increasingly complex com- 
ponents and faster clock speeds are not likely avenues of advancement. This makes 
parallel processing “technically attractive.” He also agrees that mass production will 
make the most effective use of design and production facilities. His conclusion: “A 
general purpose parallel architecture would allow cheap, standard multiprocessors to 
become pervasive.” [Ref. 14] 

May’s prediction for 1995 includes processors capable of 100 megaflops. 
INMOS believes strongly in the idea of balancing computation and communication, 
and May projects that node throughputs will have reached 500 megabytes per second. 
In 1995’s multiprocessor systems, he envisions teraflop performance. By 2000, May 
projects “scalable general purpose parallel computers will cover the performance 
range up to 10 n flops. Specialised parallel computers will extend this to 10 13 flops.” 
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D. OVERVIEW 



This chapter has surveyed the (relatively recent) history of computing, consid- 
ered the state-of-the-art, and made a few guesses as to the future. Additionally, it 
has introduced numerical and parallel computing. This serves as a backdrop for the 
remainder of the thesis. Chapter II expands the background on parallel processing 
and numerical methods. The latter provides a lead-in to the specific algorithms and 
theory that appear in Chapter III. Chapter IV introduces the parallel design and 
methods used in the work. A description of the environment, tools, and equipment 
appears in Chapter V. Results and conclusions appear in Chapters VI and VII. 

Appendices are provided to keep the chapters concise and focused. The ap- 
pendix material operates on both sides of that focus. Some of the material is de- 
signed to give sufficient background and the rest — code mostly — is provided for more 
in-depth study. The background material may be obvious to some readers and new 
to others. I have assumed that the reader has some knowledge of the background 
material. I do not presume that the reader will be familiar with the code. 

To simplify the discussion we must speak the same language. Appendix A 
gives the basic terms and notation used in the rest of the thesis. Next, we discuss 
the machines used to perform the work. While this is the subject of Chapter V, a 
more detailed account is reserved for Appendix B. Appendix C provides a general 
background on interconnection topologies. Emphasis is placed upon the hypercube 
connection scheme. Appendix D describes the process whereby a real-world problem 
is translated into matrix notation. Appendix E gives some information and results for 
communications performance in a hypercube. Finally, Appendix F provides listings 
for most of the code used in the research. 
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II. BACKGROUND 



Mathematics is the door and key to the sciences. 

— ROGER BACON 

Chapter I provided a backdrop, showing the state of scientific computing, es- 
pecially parallel and distributed forms, today. In the present chapter, the scope 
is limited to material and equipment pertaining to this research. The thesis work 
deals with methods of conjugate directions implemented upon two contemporary 
MIMD machines. The goal is to introduce the theory, machines, methods, and a few 
peripheral issues that will be helpful as background information. 

A. COMPUTING WITH REAL NUMBERS 

As illustrated in Figure 1.1, the speed of computing machinery has risen swiftly 
since the 1940s. This has often been encouraged by substantial advances in tech- 
nology. Today’s multiprocessor machines seem to be maintaining the fast-paced 
growth. Additionally — although precision is a less glamorous business than speed — 
the accuracy of machine solutions has become more standard. This section considers 
some of the principal issues of computing with finite approximations of real numbers. 

We have observed that the history of computing shows close ties to science and 
mathematics. As the design and construction of computers becomes a more spe- 
cialized business — mostly performed by electrical and computer engineers — we still 
find that many of the fundamental requirements are related to scientific problems. 
These problems typically involve mathematics and a significant amount of scientific 
computing applies numerical methods that involve real numbers. The trend in com- 



21 



puter (hardware and software) design is toward abstraction, but from time to time 
we absolutely must understand and work with the underlying, concrete principles. 

1. Finite-Precision 

New problems are generated as the speed of computing machinery improves 
with each generation of machines. One question to be considered is, how reliable 
are the machines and the software that runs on them? This is a constant concern 
in computing. Many scientific problems involve continuous phenomena in the real 
world. Accordingly, we like to be able to represent the real numbers, 3?, within the 
machine. But, lacking infinite storage, this is impossible. There have been several 
more-or-less reasonable ideas and implementations of approximations to the real 
numbers within the limits of computer storage. Of these, the floating-point concept 
of storage and arithmetic enjoys the most widespread use. 

The Institute of Electrical and Electronics Engineers (IEEE) has established 
the principal standards for floating-point representations and arithmetic. These 
standards make machine arithmetic more predictable. Surprisingly, while they exist 
in much of today’s computing hardware, the standards are not widely understood by 
practitioners. Then, software and applications are sometimes formed in ignorance. 
The title of David Goldberg’s paper [Ref. 15] speaks volumes: “What Every Com- 
puter Scientist Should Know About Floating-Point Arithmetic.” Goldberg is also 
responsible for several other contributions describing floating-point arithmetic and 
the IEEE standards. Appendix A of Hennessy and Patterson’s book on architec- 
ture [Ref. 3] is such a contribution. He gives a very useful description of the IEEE 
standards and instruction on how to perform arithmetic operations on machines that 
adhere to the IEEE standards. 
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2. IEEE 754 



Of the four precisions specified by the IEEE 754-1985 standard, this thesis 
uses the double precision format most often (to approximate real numbers) so it 
will receive the most attention. In the C programming language, these numbers 
correspond to the type double. They are floating-point values stored in eight bytes 
(64 bits). The storage representation is illustrated as three components: one sign bit, 
s; an 11-bit exponent, e; and a 52-bit fraction , /. Figure 2.1 shows an example. We 
say that e is a biased exponent. Both negative and positive exponents are stored using 
a range of positive binary numbers biased about (nearly) the middle. Significand or 
mantissa is the name given to the number (1./). The fraction is a packed form of 
the significand. This means that the leading one of the significand is implicit. This 
is called a normalized number. [Ref. 16] 

All IEEE floating-point numbers are normalized except for the special rep- 
resentations when e = 00000000000 = 0 or e = 11111111111 = 2047. These are 
called denormalized (or subnormalized) numbers. Only the fraction, /, of a normal- 
ized number is stored [Ref. 3: p. A— 14] . Figure 2.1 shows a representation of the 
floating-point number, x = 7.0. First, x is shown as it would be defined in a C 
program. The C address of operator, is used to indicate the address of x in mem- 
ory. That is, somewhere (namely &a*) in memory, there are eight contiguous bytes 
that hold a floating-point representation of x and (for illustration purposes) we can 
imagine the IEEE 754 double-precision representation of x as Figure 2.1 indicates. 

A standard, such as IEEE 754 (and the lesser-known IEEE 854), is not a 
panacea for the finite-precision problem but it lends tremendous support to those 
who would scientifically deal with the problems of finite-precision arithmetic. Pro- 
grams given in the files num.sys.h and num.sys.c (in Appendix F) are of interest 
to those who would explore further. The programs can demonstrate that the actual 
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double x = 7.0; 






10000000001 



1100000000000000000000000000000000000000000000000000 



5 t — 1025 



/ = • ll s 



Interpretation: 



x = (-I s ) X l ./ 2 x 2 e “ 1023 
= (-1°) x 1.11 2 x 2 1025 - 1023 
= l.llj x 4 
= 11 1 2 
= 7 



Figure 2.1: IEEE 754 Representation: Double Precision 



order and location of bits in memory may not match the representation of Fig- 
ure 2.1. This reflects practicalities concerning storage and transmission of bytes at 
a very low level in the machine. It is perfectly reasonable (and easier) to use the 
common abstraction of Figure 2.1 regardless of machine implementation. 



B. NUMERICAL ISSUES 
1. The Need 



Consider the problem of determining the area under a bounded function 
f(x) over a closed interval [a, 6]. Numerical quadrature (integration) rules such as 
the Trapezoidal Rule or Simpson’s Rule are used to arrive at an approximating (or 
Riemann ) sum of many smaller areas within the region. Numerical methods are 
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often used to approximate the solution to a problem. This is no trivial problem. To 
solve it (numerically) by anything other than accident, one must first understand 
the theory and analytical approach. Next, the problem can be translated into an 
algorithm (a plan — usually mathematical in nature — for solving the problem step- 
by-step) which can, in turn, be translated into the sort of language that a machine 
understands. 

This is a relatively simple approximation problem compared to the problem 
of finding the solution to a system of 500 equations in 500 unknowns. Consider the 
(perhaps more realistic) problem of using numerical linear algebra to solve an elliptic 
partial differential equation like the one presented in Appendix D. Numerical con- 
cerns abound in problems such as these. Additionally, many problems in numerical 
linear algebra have time complexities of Q(n 2 ) or 0 (j? 3 ) and storage requirements of 
0(n 2 ) so speed is essential. (Appendix A reviews the complexity notation such as 
big-Oh and big-Theta). 

2. Errors and Blunders 

A clear understanding of the differences between errors and blunders is 
important since recognition of the source of error is prerequisite to eliminating or 
reducing them. The terms are introduced in [Ref. 17: p. 1]: 

Blunders result from fallibility, errors from finitude. Blunders will not be 
considered here to any extent. There are fairly obvious ways to guard against them, 
and their effect, when they occur, can be gross, insignificant, or anywhere in be- 
tween. Generally the sources of error other than blunders will leave a limited range 
of uncertainty, and generally this can be reduced, if necessary, by additional labor. 

It is important to be able to estimate the extent of the range of uncertainty. 

— ALSTON S. HOUSEHOLDER 
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3. The Issues 



To anticipate — or even troubleshoot — error we must know from whence it 
comes. In [Ref. 17: p. 2], Alston Householder lists the four sources of error that 
were set forth by John von Neumann and Herman Goldstine: 

• Mathematical formulations are seldom exactly descriptive of any real situation, 
but only of more or less idealized models. Perfect gases and material points do 
not exist. 

• Most mathematical formulations contain parameters, such as lengths, times, 
masses, temperatures, etc., whose values can be had only from measurement. 
Such measurements may be accurate to within 1, 0.1, or 0.01 percent, or better, 
but however small the limit of error, it is not zero. 

• Many mathematical equations have solutions that can be constructed only in 
the sense that an infinite process can be described whose limit is the solution 
in question. By definition the infinite process cannot be completed. So one 
must stop with some term in the sequence, accepting this as the adequate 
approximation to the required solution. This results in a type of error called 
the truncation error. 

• The decimal representation of a number is made by writing a sequence of digits 
to the left, and one to the right, of an origin which is marked by a decimal 
point. The digits to the left of the decimal point are finite in number and 
are understood to represent coefficients of decreasing powers of 10. In digital 
computation only a finite number of these digits can be taken account of. The 
error due to dropping the others is called the round-off error. . . . 
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C. MACHINE METHODS 



We would like to somehow characterize the techniques that make a problem- 
solving method “good” . The abilities of machines and people are distinct enough that 
we should not always expect an algorithm for machine solution to mirror the pencil- 
and-paper method of an individual. Hestenes and Stiefel make this distinction, defin- 
ing a hand method as “one in which a desk calculator may be used” and a machine 
method as “one in which sequence-controlled machines are used.” [Ref. 18: p. 409] 
Further, in the same reference, they list the following characteristics that a good 
machine method exhibits: 



(1) The method should be simple , composed of a repetition of elementary 
routines requiring a minimum of storage space. 

(2) The method should insure rapid convergence if the number of steps re - 
quired for the solution is infinite. A methcxl which — if no rounding-off errors 
occur — will yield the solution in a finite number of steps is to be preferred. 

(3) The procedure should be stable with respect to rounding-off errors. If 
needed, a subroutine should be available to insure this stability. It should be possible 
to diminish rounding-off errors by a repetition of the same routine , starting with 
the previous result as the new estimate of the solution. 

(4) Each step should give information about the solution and should yield a 
new and better estimate than the previous one. 

(5) As many of the original data as possible should be used during each step 
of the routine. Special properties of the given linear system — such as having many 
vanishing coefficients — should be preserved. ( For example , in the Gauss elimination 
special properties of this type may be destroyed.) 

D. CONJUGATE DIRECTIONS 



Hestenes and Stiefel describe the method of conjugate directions (CD). This is 
a general approach to solving systems of linear equations that uses direction vectors , 
Po, Pi, • ■ . , to determine how the search for a solution should proceed from step- 
to-step. When the method for determining these vectors is defined, CD becomes a 
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specific method. There are at least two of these specific methods within CD that 
are especially suited to computer implementation: Gauss factorization (GF) and the 
method of conjugate gradients (CG). [Ref. 18: p. 412] 

The term conjugate is clearly an important one for these methods. Given a 
matrix A € 9? TlXn that is symmetric , we say that two vectors x and y are conjugate 
if 



x T Ay = [Ax) T y = 0 . 



( 2 . 1 ) 



There is an alternative term that emphasizes the role of A in this definition. We also 
say that x and y are A-orthogonal. [Ref. 18: p. 410] 

The method of conjugate gradients chooses its direction vectors, p,, to be mutu- 
ally conjugate (pj Ap : = 0 whenever i ^ j) and in such a manner that p,+i depends 
upon pi. (A specific formula is given near the end of Chapter III). The Gauss fac- 
torization chooses pi = e,-, the i th axis vector. [Ref. 18: pp. 412,425-427] 

In this research, the Gauss method gets almost all of the attention, but the 
method of conjugate gradients receives a short overview near the end of Chapter III. 
The theory of conjugate directions is not at all trivial, and the ties of Gauss and 
conjugate gradients to conjugate directions are fairly deep. These issues are covered 
in the work of Hestenes and Stiefel [Ref. 18]. This thesis develops the Gauss method 
from an implementation standpoint. 



E. PARALLEL PROCESSING 



The field of parallel and distributed computing is a relatively new one. In 
one sense, it is quite natural. We perform work in parallel every day. In fact, a 
manager-worker notion is a very useful means to understand the issues of this field. 
The programs developed in this research involve a host or manager and nodes or 
workers. This is often called the workfarm approach. 
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The principal “problem” in parallel computing is communication. Appendix C 
relates some of the considerations. Of course, there are other concerns as well: load 
balancing, problem size (granularity), and so on. These issues, as they apply to the 
this research, are discussed in Chapter IV. 

The bottom line — after all of the design and implementation work — is perfor- 
mance. With multicomputers , as in a workfarm, we are after efficiency so that more 
computing can be done in a shorter time and for less money. Bell is even more 
specific. He believes the multicomputer must offer two key facilities to become es- 
tablished [Ref. 6: p. 1097]: 

• Power that is not otherwise available. 

• Performance for a price that is “at least an order of magnitude cheaper than 
traditional supercomputers.” 

In Chapter VI, we consider results obtained upon two contemporary parallel 
machines. This information helps us to evaluate the potential of MIMD architectures 
in terms of Bell's criteria. 

F. SPEEDUP 

The terms speedup and efficiency , defined in Appendix A, capture most of the 
interest when we talk about the potential of parallel computing. The principal reason 
for choosing a multicomputer over a single computer is speed. Therefore, we are most 
interested in knowing what kind of speed we can obtain from a multiprocessor system. 
Bell’s comments on price are germane as well. 
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Speedup and efficiency are both machine dependent and problem dependent. 
Some problems should not be executed on a parallel machine! Suppose, for instance, 
that part of a problem must be performed sequentially. Amdahl’s law is a well-known 
attempt to characterize this problem. Amdahl stated that speedup on P processors, 
S, is limited in the following manner: 



5 < 



1 



(2.2) 



where / is “the fraction of operations in a computation that must be performed 
sequentially, where 0 < / < 1” [Ref. 19 : p. 19]. With speedup, S, defined as 
in (2.2) we see that 

lim S=j. (2.3) 

P— oo J 

Figure 2.2 shows how this limit begins to take effect as the number of processors, 
P, is increased from zero to 500. The figure is based on Amdahl’s law (2.2) with 
sequential percentages, /, of 5%, 10%, and 25%. 

We can see that Amdahl’s law has some very discouraging news for so-called 
massively parallel computing. The massive part of the term is loosely defined, appar- 
ently meaning “many” processors. But Amdahl's law may be based upon a faulty 
assumption [Ref. 20]. Consider the following reasoning. Let P be the number of 
processors and consider the following arguments concerning time. Let s be the time 
required to execute the serial portions of a program on a serial processor and let 
p be the amount of time required to complete the parallel work on the same serial 
processor. Using this notation, and normalizing (s + p = 1), Amdahl’s law can be 

restated 

s + p 



s = 



1 



(2.4) 



s + {p/P) s + {p/P)' 

Then, if we consider the case P = 1,024 with s < 10%, we see in Figure 2.3, that 
speedup is severely restricted. 
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Figure 2.2: Amdahl’s Law (1 < P < 500) 
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Figure 2.3: Amdahl’s Law (P = 1024) 
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G. SCALED SPEEDUP 



These problems with the usual notion of speedup led Gustafson, Montry, and 
Benner to question the validity of Amdahl’s assumptions [Ref. 20: p. 3]: 

The expression and graph are based on the implicit assumption that p is 
independent of P. However , one does not generally take a fixed size problem and 
run it on various numbers of processors; in practice , a scientific computing problem 
scales with the available processing power . The fixed quantity is not the problem 
size but rather the amount of time a user is willing to wait for an answer; when 
given more computing power , the user expands the problem (more spatial variables , 
for example) to use the available hardware resources . 

As a first approximation , we have found that it is the parallel part of a pro- 
gram that scales with the problem size. Times for program loading, serial bottle- 
necks , and I/O that make up the s comjxynent of the application do not scale with 
the problem size . liTien we double the number of processors } we double the number 
of spatial variables in a physical simulation. As a first approximation , the amount 
of work that can be done in jxirallcl varies linearly with the number of processors 



Based upon this analysis, they present the notion of scaled speedup. They let 
s' and p' represent the serial and parallel time spent on a parallel system (inverse of 
Amdahl’s method). So that s' + p' — 1 and a uniprocessor requires time s' + p'P to 
perform the task. With these definitions, they define scaled speedup, 5', to be 

s' “1“ v' P 

s' = -r^-T = p + (l -P)s'. (2.5) 

S + p 

If we consider the same range of serial fractions as we did in Figure 2.3, we see that 
scaled speedup is much better than the usual speedup. Figure 2.4 shows the plot of 
scaled speedup. 

H. SUMMARY 

This chapter considers the background necessary to develop the algorithms 
(Chapters III and IV) and implement them (Chapter V). Algorithms are described 
as sequential plans first (Chapter III). The Gauss factorization algorithm is given 
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Figure 2.4: Scaled Speedup 



in detail (Chapter III), including a discussion on the significance of pivoting. The 
method of conjugate gradients receives less attention, but a brief introduction is 
given near the end of Chapter III. The parallel considerations surveyed quickly in 
this chapter receive more attention in Chapter IV. 
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III. THEORY 



No human investigation can be called real science if it cannot be demonstrated 

mathematically. 



— LEONARDO DA VINCI (1452-1519) 



A. SCOPE 

The goal of this research is to demonstrate a parallel method for solving a 
system of linear equations. The implementation targets two contemporary MIMD 
architectures: the Intel iPSC/2 and networks of INMOS transputers. There are many 
methods for solving linear systems. This work concentrates primarily upon Gauss 
factorization (GF), but the method of conjugate gradients (CG) is also introduced. 
Regrettably, CG is not developed due to time constraints (the derivation is not 
trivial). This does not imply that Gauss factorization is superior, nor that it possesses 
greater potential for parallel solution. Indeed, Hestenes and Stiefel preferred CG to 
GF for a number of very good reasons [Ref. 18: p. 409]. 

As we shall see, the utility' of either method is quite dependent upon the nature 
of the particular problem. Consider the system of linear equations represented by 

Au = b. (3.1) 

Much of the subsequent discussion applies to general, rectangular systems where 
A € 3? mXn . For the examples, however, square systems (A € 9? nXn ) are used. This 
restriction greatly simplifies the discussion without losing much of the concept as 
it applies to general systems. The Gauss process, i.e., the main part of the work, 
excluding the stopping criteria and interpretation of the result, is the same in all 
three cases (m < n, m = n, and m > n). 
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To be sure, the three cases (m < n, m = n, and m > n) correspond to funda- 
mentally different real-world systems, but the algorithms for each case are almost 
identical. The restriction to a square system will greatly simplify the discussion 
without blinding us to the general, rectangular case. The extensions to the general 
case are well known. Golub and Van Loan [Ref. 21 : p. 102] give more detail, but the 
square case is most expedient for now. Square systems also simplify the experimental 
procedure, data collection and analysis. 

The Gauss method follows naturally from a hand method and it holds strong 
appeal to intuition. Without a pivoting strategy , however, Gauss can attempt division 
by zero. There is also a more subtle issue of rounding errors within the limits of 
finite-precision arithmetic. To forestall errors of both kinds, partial and complete 
pivoting strategies are used. This chapter develops the (sequential) algorithms and 
explains the concept of pivoting. This is a sensible starting point for Chapter IV, 
where parallel versions of the algorithms are given. 

B. APPROACH 

There are many methods that may be applied to determine the solution of a 
system of linear equations. The methods were designed for different reasons and 
with different problems in mind, so each exhibits a unique behavior. One method 
is often preferred over another for a given problem. Ultimately, the criterion is 
performance, both in reliability and speed. The approach described here and in the 
remaining chapters seeks to “maximize performance” while retaining a reasonable 
balance of both efficiency and quality. Speed and numerical accuracy tend to oppose 
one another so we are left to choose from several options. 

A hand method introduces each algorithm. The example is small and concrete. 
Solving a small problem gives useful insights into the algorithms. Once the hand 
method is established, it is expressed in an equivalent matrix notation. A high-level 
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sequential algorithm is built upon this foundation. This algorithm shows how a 
machine, using a sequence of instructions, solves the problem. It also gives good es- 
timates for the problem’s time and storage complexities. The sequential-to-parallel 
transition involves enough issues to warrant separate coverage. These considerations 
appear in Chapter IV. 

In the sections that follow, Gaussian elimination is presented first. It reveals the 
background (sort of a first pass) for Gauss factorization. Once the reduction process 
is understood, we proceed to factorization. A description of the method of conjugate 
gradients is given at the end of the chapter. This method, due to Hestenes and 
Stiefel, is based upon relatively deep theory. Thus the derivations and background 
are not included. Nevertheless, a synopsis of the method is given. 

C. APPLYING THE METHODS 

A particular method is often tailored to a specific type of system. The method 
of conjugate gradients, for instance, is usually used when the matrix of coefficients, 
A, is symmetric and positive definite [Ref. 18: p. 411]. The Gauss factorization 
algorithm is equally important, but it takes quite another approach to solving this 
system. Both CG and GF lie within the broad category of methods of conjugate 
directions (Chapter II). Indeed both work in just about any case. But, the better 
results are obtained by using the tool that fits the task at hand. 

A very rough characterization of the problem can simplify algorithm selection. 
We will look for two qualities: structure and density. CG, for instance, performs 
best when applied to highly structured, sparse matrices (i.e., matrices with many zero 
entries). Systems like the sparse, symmetric, highly-structured result of Appendix D 
deserve careful solutions that do not destroy the existing zeros. Zeros are not always 
easy to come by. Gaussian elimination must expend 2n 3 /3 flops to create them. 
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Selecting the wrong algorithm can lead to slower execution. More importantly, 
poor algorithm choice is a blunder (Chaper II). It can produce results that are ac- 
cidentally perfect, grossly incorrect, or anywhere between. Therefore, no less than 
three tasks confront us: 

• Characterize the problem. In systems like (3.1), attributes of the matrix of 
coefficients, A, may provide a wealth of information. 

• Understand the algorithm(s). Know the types of problem(s) it is designed for 
(and, more importantly, know why). 

• Create or select an algorithm that suits the problem. 

The sparse, highly-structured problems are not rare! Anyone who has observed 
nature knows that many natural phenomena exhibit incredible structure and sim- 
plicity. Strategies for solving the corresponding system should always seek to exploit 
these characteristics. Both sparseness and structure can reduce storage requirements 
and the number of flops required. If we know the structure in advance, there may 
be a smart way to avoid some calculations entirely or minimize the work involved. 
(Recall Hestenes and Stiefel’s characterization of a “good” machine method from 
Chapter II). Other problems, when translated into the form (3.1), exhibit a dense 
matrix, A, with little or no apparent structure. 

These two types of problems should not be handled with the same tools. As 
with many computational problems, the reasons involve the use of time and space. 
We shall see that the Gauss algorithm has time complexity 0(n 3 ) and storage re- 
quirements 0(n 2 ). (Complexity notation appears in Appendix A). Numbers like 
these grow rapidly with n and, regardless of how much memory is available, the 
problem can quickly overpower the computer. A naive approach to problems of 
these kinds can be expensive in terms of both storage and time. This is usually 
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adequate incentive to take advantage of sparseness and structure whenever possible. 
When it is not possible, Gauss is a good choice. 

D. GAUSSIAN ELIMINATION 

Suppose that we want to solve a system of linear equations using a systematic, 
step-by-step method. We assume that the system of linear equations is given, and 
that the method must preserve the original properties of the system. That is, the 
method must be restricted to certain operations; namely: 

• Multiply an equation by a nonzero constant. 

• Interchange equations. 

• Add a multiple of one equation to another. 

The fact that the first two operations do not change the system’s properties is ev- 
ident. The third operation is legitimate also — maybe not quite so obviously — and 
computationally, the most significant. Now let us apply some of these operations to 
a system of four equations in the four unknowns, Uj, u 2 , t> 3 , and u 4 . 

2t>i -f 3t>2 4 4t>3 + 5 u 4 = 0 

4t>, 4 6 t > 2 4 8 u 3 -I- 5u 4 = -5 . 

2t>i + 4 u 2 4 7v 3 4 9v 4 = 13 [ ’ 

6i>i 4 St> 2 4 St> 3 4 9 u 4 = —17 

Let m (= 4) be the number of equations, and let n (= 4) be the number of unknowns 
in each equation. Additionally, let i be an equation (or row) index (1 < i < m) and 
let j indicate a subscript of v (column index) so that 1 < j < n. Finally, let a be the 

coefficient of v : in equation i (e.g., a 12 = 3). Suppose that the last equation contains 

only one nonzero coefficient (say a 44 ) and the third equation has only two nonzero 
coefficients (033 and q 34 ) and so on. This defines a triangular system (Appendix A). 
The triangular system is our goal because it is easier to solve (by back substitution ) 
than the current (square, dense) system. 
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Next, observe that a triangular system would result if we could eliminate every 
coefficient, o tl , of v x in all equations but the first (i >1), coefficients, a l2 , of v 2 in 
the last two equations (i > 2), and the coefficient, a 43 , of u 3 in the final equation. To 
do this, we work by stages. At stage k, the coefficient, a^, of in the k th equation 
is called the pivot . This term has little significance now but is clarified later (and 
it plays a very important role in the examples presented. In a particular stage, A;, 
the goal is to operate upon all equations i where i € {(k + 1 ),(fc + 2),... ,r?i} and 
eliminate all coefficients, a,**, of v 

1 . A Hand Method 

Before attempting to describe an algorithm for a machine solution, we con- 
sider an application of Gaussian elimination (GE) by hand. Initially, let k = 1 . In 
the example system (3.2), the first (k = 1) pivot is the coefficient, an = 2, of V\ 
in the first equation. Notice that by subtracting twice the first equation from the 
second, a zero is produced under the pivot (eliminating o 2 i). Similarly, by subtract- 
ing the first equation from the third, a zero appears as the leading coefficient in the 
third equation (eliminating o 3 i). Finally, three times the first equation subtracted 



from fix' fourth equation eliminates tho coefficient 


on- 


Following these steps the 


altered system is: 






2i>i + 3 i' 2 + 4t*3 + 5t\| = 


0 




-fn.-, = 
i>2 -f 3 i'j -f ‘lr\, = 


-5 

13 


(3.3) 


-V-2 - 4l' 3 - Cl’., = 


-17 





This is called the natural reduction process [Ref. 22: p. 72]. In the particular case, 
then' are no changes on the right-hand side because the first equation’s right-hand 
side is zero. This makes for trivial arithmetic on the right-hand side, but we should 
remember to perform the arithmetic upon whole equations (including the right-hand 
side) in general. The elimination is even more successful than planned. 
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The second equation already has zeros where we ultimately wanted them 
in the fourth equation. That is, the system (3.3) would be closer to upper triangular 
if we were to alter it by interchanging equations 2 and 4. 



2tq -f 3t’2 + 4 u 3 + 5iq = 0 

—v 2 — 4t’ 3 — 6r 4 = —17 

t >2 T 3t>3 T 4v 4 — 13 

— 5t’4 = —5 



(3.4) 



The system (3.4) is called a row permutation of (3.3). The ability to recognize 
patterns is a great advantage that human problem solvers enjoy. Therefore, taking 
advantage of our capabilities we use a rather subjective “human” pivoting strategy. 
But it is not fitting to assume that an efficient algorithm for a machine would involve 
the same sort of pattern recognition. 

The system (3.4) is nearly triangular. The pivot moves to the second equa- 
tion ( k = 2 ), and we focus on the coefficient, 0-22 = — 1, of Vk = v 2 . By adding 
the second equation to the third, the only nonzero coefficient remaining in the lower 
triangle (Q 32 ) is eliminated. The resulting system becomes 



2v\ T 3^2 4" 4 f 3 4" 5v 4 — 0 

— 1’2 — 4t>3 — 6v 4 = —17 

— 1> 3 - 2v 4 = -4 

— 5iq = —5 



(3.5) 



The system is triangular, and it is easy to solve for the unknown values, t>,-, by back 
substitution. By inspection, v 4 = 1 . Substituting this value into the third equation, 
we find that t > 3 = 2 . Substituting both values (v 4 and u 3 ) into the second equation 
yields v 2 = 3. Finally, by substituting the values v 4 , v 3 , and v 2 into the first equation 
gives Vi = — 11 . The solution to the system is then 



u = 



' u, 




' -11 ' 


v 2 




3 


v 3 




2 


. v 4 _ 




1 . 



(3.6) 
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2. A Machine Method 



The foregoing example illustrated the GE process as done on paper. The 
system was intentionally created for easy solution by hand calculation. I.e., it uses 
integers and elimination occurs faster than the usual case. Even this simple example 
requires a few minutes to determine u from the system (3.2) by hand. In Chapter 
VI, we see that a machine can perform this task in (much) less than a second. For 
this reason, it is worth examining an equivalent process to solve for such a system 
by machine. 

We reenact the solution from the beginning, this time in a fashion that 
a sequence-controlled machine could perform. Until now, we have used the term 
“pivot” but have found no practical use for pivots. In this example, we begin to 
realize the utility of a pivoting strategy. We start with “no pivoting” and shift to 
the “partial pivoting” strategy. Additionally, we begin to use a more compact matrix 
notation. Appendix A describes the notation followed. 

By the method described in Appendix A, we give the linear system (3.2) 
matrix representation that corresponds to (3.1): 



' 2345 ' 




' " 




■ 0 ' 




fix I 


4 6 8 5 




t ' 2 




-5 






2 4 7 9 




^3 




13 




A 


. 6889 . 




. v A . 




. -17 . 




LA J 



(3.7) 



First, we initialize a stage counter, k, so that k = 1. The pivot in stage k is a**, on 
the diagonal of A (an = 2). The immediate goal is to produce zeros beneath the 
pivot, in ;4(2:4,1). A three-step process eliminates these coefficients in row order: 



• Divide. Divide every element beneath the pivot by the pivot value. 



• Update. Perform arithmetic in the Gauss transform area. 



• Eliminate. Set the elements beneath the pivot equal to zero. 
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The first step is a division. The denominator (pivot) is ctkk = On = 2 so 
a 2 i becomes the multiplier (a 2 i/2) = 2. Similarly, let o 31 = 1 and let q 41 = 3. Now 



A = 



’ 2345 ' 

2 6 8 5 
14 7 9 

3 8 8 9 



(3.8) 



Next, consider everything below and to the right of the pivot. This is the Gauss 
transform area , G = A((k -+■ l):m, (k 4- l):n) = A(2:4,2:4). For each element in 
G , replace the current value, a,j, w’ith o,j — (Q,^)(a*.j). Do the same thing in the 
corresponding rows (i > k) of b, replacing /?, with /?,• — (£>,*)(/?*)• We will call this 
the process of performing arithmetic in (or updating) the Gauss transform area, G. 

Finally, when the values beneath the pivot are no longer needed, eliminate 
them (set them equal to zero). The result is equivalent to the system (3.3): 



'2 3 4 5 ' 




’ t>i 




‘ 0 ■ 


0 0 0 -5 




v 2 




-5 


0 13 4 




V 3 




13 


CD 

1 

1 

7 

0 
» 




. V 4 . 




. -1" . 



We have finished one stage of GE. We move into the next stage, k = 2. This time, 
when we try to update G we run into a very serious problem. The first step is to 
divide everything underneath the pivot by the pivot value ctkk = <>22 = 0. This is 
the divide-by-zero problem of a “no pivoting’' strategy. 

During the execution of the hand example we simply moved the row to the 
bottom of the system to avoid this problem. Now, we could instruct the machine 
to test every element in A(k :m,k : n) and interchange row's so that those with 
the most leading zeros were placed at the bottom. This is problematic for several 
reasons. First, it is not dependable (testing for equality of floating-point numbers 
begs disaster). Secondly — even if we could identify zeros w’ith confidence — it would 
add a sorting problem to GE! We are not looking for extra work. The solution is 
partial pivoting. 
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3. Partial Pivoting 



Partial pivoting is an application of row interchanges to eliminate (primar- 
ily) the divide-by-zero problem. Consider the system of equations (3.1) with the 
nonsingular matrix of coefficients, A 6 9? mXn (i.e., m = n and the system has exactly 
one solution). Suppose further that storage and arithmetic is performed in infinite 
precision. (These assumptions — infinite precision and A nonsingular — are essential). 

Even in this ideal situation Gauss without pivoting is dangerous because, 
as we have just seen, it may attempt to divide by zero. Proper row permutations 
completely eliminate this problem. Partial pivoting will guarantee the existence 
of n nonzero pivots for A nonsingular. In fact, if we encounter a zero pivot with 
partial pivoting, it means that A is singular [Ref. 23]. The remainder of this section 
describes the partial pivoting strategy. 

Consider stage k of the GE process with A € 3? mxn . The goal is to pick 
the “best” row remaining (i.e., at or below the current pivot) and install it as row 
ft, the pivot row. For reasons that are explained later, “best” shall mean the row 
whose k th (pivot column) element is largest. Let s be the row index for the best 
pivot candidate. Initially, let s = k (i.e., a** is the first candidate). Next, we move 
down the pivot column, considering all a,* where i > k. 

To eliminate unnecessary assignments, we replace the current candidate 
with another only if |a t >| > |a 4 *|. When this occurs, we make sure that s is updated 
by setting it equal to i. After considering all elements, a,-*, for k < i < m, s is the 
index of “best possible” pivot row. To accomplish our goal, we must perform a row 
interchange. This is easy after the new pivot row has been determined. We simply 
swap rows k and s (if k ^ s). Within the assumptions above, we have completely 
eliminated the potential for division by zero. Now let us return to the problem at 
hand. 
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4. A Machine Method (Resumed) 



Applying partial pivoting to the system (3.9), wc find that the next pivot 
is located at A( 3,2) so we must interchange row’s (equations) tw'o and three. Be- 
fore performing this step, however, let us create a vector to keep track of the row 
permutations. Let q € 3?™ be the row permutation vector. We initialize q so that 



■ 01 ■ 




' 1 ' 


02 




2 


03 




3 


. 04 . 




. 4 . 



(3.10) 



and perform row interchanges in q corresponding to those in A so that ?/’. is always 



the original equation number for current equation number i. Thus, after performing 



the row interchange, we have 



’2 3 4 5 ' 




’ v x 




■ 0 ' 




■ i ■ 


0 13 4 




v 2 




13 




3 


0 0 0 -5 




V 3 




-5 


<7 = 


2 


1 

O 

l 

i 

7 

o 
1 




. v 4 . 




. -17 . 




. 4 . 



Notice that t/> 3 = 2 indicates that the third equation in (3.1 1) was the second equation 
in the original system (3.7). Now, since q 32 = 0, no arithmetic is required in the 



third row’. In row four, the arithmetic will be equivalent to the notion of adding (the 



current) equation tw f o to equation four. The result is 



'2 3 4 5 ' 




' id 




■ 0 ' 


0 13 4 




v 2 




13 


0 0 0 -5 




^3 




-5 


l 

C4 

1 

7 

o 

o 




. ^4 . 




. “4 . 



(3.12) 



When we move the pivot index to the third equation ( k = 3), w’e notice that a 33 = 0. 



The divide-by-zero problem has resurfaced. Once again, we pivot, swapping rows 



three and four. After this, w’e have 



'2 3 4 5 ' 

0 13 4 




' Vi ' 

v 2 




■ 0 ' 
13 




' i ■ 
3 


00 - 1-2 
.00 0 -5 . 




V 3 

. . 




-4 

. -5 . 


Q — 


4 

. 2 . 



(3.13) 
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Tlic zero beneath the final pivot obviates the need for further arithmetic. The trian- 
gular system (3.13), found by our machine method, docs not look like the system (3.5) 
from the hand method because we did not perform the same row interchanges. If we 
had maintained a row permutation vector, q , for the hand method w’e would have 
noticed that 




(3-14) 



Of course, back substitution for the final (triangular) machine system (3.13) yields 
the same solution 



‘ ■ 




■ -11 ■ 


v 2 




3 


V 3 




2 


. v 4 . 




1 . 



(3.15) 



as that of the hand method. Thus, even though we used different permutation 
schemes, the “pivots” in both cases were alw'ays nonzero and the solutions w r ere the 
same. This is not surprising, since A is nonsingular and row’ permutation is merely 
the practice of interchanging equations. 

Let us review first the process and then the theory of Gaussian elimination. 
The GE process performs a systematic elimination of the lower (in our example) 
triangle of a matrix of coefficients, A. Arithmetic operations are performed upon 
entire equations at the same time (including the right-hand side, 6). In other words, 
during stage k of the process, arithmetic operations are performed upon (portions of) 
all rows i ( i > k) of A and upon all elements (rows) (3, (for i > k) of the right-hand 
sides, b. The process depends upon both A and b and both of them can be changed 
substantially. 

The idea behind Gaussian elimination is that general square systems are 
difficult to solve, but triangular systems arc easy. The goal is to transform a general 
matrix A into triangular form, performing legitimate arithmetic upon entire equa- 
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tions (including the right-hand sides). Reduction to triangular form costs 2n 3 /3 
flops. Once A is reduced to triangular form, back substitution yields a solution for 
the unknown, u, in n 2 flops. Thus GE solves a general, dense, square system of n 
equations in n unknowns by the application of 2n 3 /3-fn 2 flops. [Ref. 21 : pp. 88, 97] 

E. GAUSS FACTORIZATION 

Gauss factorization (GF) is a well-known method for solving linear systems 
like (3.1) that (simultaneously) factors A. GF has strong ties to the GE process. 
Those ties will become evident as we develop the same example over again, this time 
using the GF bookkeeping and method. GF holds several major advantages over GE. 
Among these: A is recoverable (the process does not destroy it) and the process is 
independent of the right-hand side , b. In fact, b is not used in the factoring process. 

1. Complete Pivoting 

The complete pivoting strategy will be applied in this example. There is no 
special significance behind the introduction of complete pivoting with the GF process. 
Either strategy — the choice of a “no pivoting” strategy is also available, but not 
generally acceptable for serious problems — can be used with GEor GF. The complete 
strategy is a straightforward extension of the partial strategy, so introducing partial 
pivoting first was practical. 

With complete pivoting, row interchanges are still allowed, but so are col- 
umn interchanges. We will continue to use q E for row interchange bookkeeping. 
The vector p E 3£ n , similarly, will maintain the column permutation information. We 
search not just the pivot column, but the entire Gauss transform area, for the next 
pivot. This takes longer but generally produces better solutions. The numerical dif- 
ferences between partial and complete pivoting involve some difficult error analysis. 
These issues will be addressed briefly after we complete the examples. 
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2. Example 



Now the GF process is demonstrated. We start with the same system of 



four equations in four unknowms: 



2ui -)- 3 l> 2 "h 4u 3 -j- 5^4 — 0 

6u 2 4" St> 3 -f- 5^4 = — 5 

2ui 4- 4 l>2 4- 7l> 3 4- 9u 4 = 13 

6ui 4- 8u 2 4- 8 u 3 + 9u 4 = —17 



(3.16) 



and proceed immediately to the matrix of coefficients (the factoring part of GF 



concerns itself with A only). 



’ 2345 ' 
4 6 8 5 
2 4 7 9 
6 8 8 9 



(3.17) 



a. Stage Zero 

For the initial stage, k = 0, let the Gauss transform area be G — A. 
Also initialize pivot indices s = t = 1. The sole purpose of stage zero is to find the 
first pivot. Initially, we guess that the pivot is o n , located at A(l,l), the upper 
left-hand corner of G. (This is the position where the new pivot will be installed). 
Accordingly, we set row and column indices, s = 1 and t = 1 to keep track of the 
best pivot candidate. 

Indices s and t are changed only when we find a superior candidate for 
the pivot. To begin the column-by-column search for the pivot we move down the 
columns in order from left to right and through each column in a top-to-bottom 
manner. When we have considered every element in G , we know that the next pivot 
is currently situated at A(s,<). 

For the current example, as we move dow r n the first column of G, the 
values of s and t are adjusted twice. A better pivot candidate is found, first at A{ 2, 1), 
and next at A(4,l). The indices are adjusted again in the last row of column two, 



48 



where the value, 8, is larger than the value of the current candidate, 6. Column 
three has no candidates larger than S, so we do not adjust the indices again until we 
find the 9 at ,4(3,4). Thus s = 3 and t = 4 have located the next pivot according 
to a complete pivoting strategy. This accomplishes the goal of stage zero. Now we 
specify the process for each of the remaining stages. 

b. Outline of the GF Process 

For each stage, k, of GF, we shall perform the following steps: 

• Locate the pivot according to a pivoting strategy (none, partial, or complete). 
If complete pivoting is used, search all of G for the next pivot. 

• Increment the pivot index, k. 

• Perform any row and/or column permutations that are required to move the 
pivot into the position A(k,k). Update p and q accordingly. 

• Divide every element beneath the pivot by the pivot value. 

• Redefine the Gauss transform area so that G = yl ( ( A* -+- l):m, (k -f l):n). 

• Perform the appropriate arithmetic in G. 

Let us return to the example and exercise the process. 

c. Stage One 

Since stage zero has already located the first pivot, the first step of 
section b is not necessary in this stage. We increment k (to k = 1) and install the 
pivot ,4(3,4) at A(k,k) = ,4(1, 1). This means that rows 1 and 3 must be swapped. 
Columns 1 and 4 must be swapped in addition. The permutation vectors, p and q, 
record the interchanges. 
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After interchanging rows and columns, we have 



'9472' 




' 4 ' 




' 3 ' 


5 6 8 4 




2 




2 


5 3 4 2 


P - 


3 


9 = 


1 


.9886. 




. 1 . 




. 4 . 



(3.18) 



Now we perform the division beneath the pivot, producing the multipliers in the 
lower three rows in the leftmost column of A. When this is done, we perform the 
arithmetic in G = A((k + 1) : m , (k + 1) : n) = A{2 : 4, 2 : 4). For GF, we do not 
replace the multipliers with zeros. We shall find that the multipliers are very useful 
in the end. The result is 



’9 4 7 2 

5/9 34/9 37/9 26/9 
5/9 7/9 1/9 8/9 

14 14 



(3.19) 



Next, with G being the lower right (3 X 3) block of A, we search G for the next pivot 
and find that A(s,t) = A( 2,3) holds (37/9), the largest second pivot candidate. 



d. Stage Two 



We increment the stage counter (k = 2), so that it points to the new 
pivot location, A( 2,2). Since s = k, we know that no row interchange is necessary 
and q will not change. We must, however, swap columns k = 2 and / = 3. The result 
is: 



’ 9 


7 


4 


2 ' 




' 4 ' 




■ 3 ■ 


5/9 


37/9 


34/9 


26/9 




3 
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5/9 


1/9 


7/9 


8/9 


P ~ 
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9 — 
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4 




. 1 . 




. 4 . 



(3.20) 



Once again, we divide everything under the pivot by the value of the pivot and 
update G. This yields 



9 7 4 2 

5/9 37/9 34/9 26/9 

5/9 1/37 25/37 30/37 

1 9/37 114/37 122/37 



(3.21) 
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e. Stage Three 



Now G becomes the (2 x 2) lower right block of A and the next pivot 
(122/37) is found at A(s,t) = >1(4,4). Since k = 3 we must interchange rows 3 and 
4 as well as columns 3 and 4. The result of the permutation is 



'9 7 2 4 

5/9 37/9 2G/9 34/9 




' 4 ' 
3 




’ 3 ' 
2 


1 9/37 122/37 114/37 


P — 


1 


<7 — 


4 


. 5/9 1/37 30/37 25/37 




. 2 . 




. 1 . 



(3.22) 



Then, dividing at the bottom of the pivot column and updating G, we have 



9 7 2 4 

5/9 37/9 2G/9 34/9 

1 9/37 122/37 114/37 

5/9 1/37 15/61 -15/183 



(3.23) 



f. Stage Four 



The final stage, where k = 4 = min (m, n), is always trivial. We need 
only to verify that a 44 is nonzero. This tells us that, indeed, A is nonsingular. There 
is no arithmetic to perform, so (3.23) is the final, factored, copy of A. 



g. Summary 

Using the Gauss factorization process we have systematically trans- 
formed the matrix A £ 3ft 4x4 into a form that factors the original version of A. At 
this point the factorization itself has not been discussed, only the process whereby 
w r e claim to have factored A. Before we explore the resulting factorization, let us 
consider — in a general way — what happens in any stage, k , of GF. 

3. One Stage of Gauss Factorization 

The most important part of GF is the factorization that it produces. 
The GF process is reversible (pivots and other key information become part of the 
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factorization). This section — using block matrix notation and induction on the stage 
number — illustrates the effect of one stage of GF. The proof shows that we can 
perform an n-step Gauss factorization A = LR, with L unit lower triangular and R 
right (upper) triangular with nonzero diagonal elements. Before the proof, however, 
let us consider a concrete illustration where n = 15. 

Let 0 denote those elements that Gauss has fixed in both value and position. 
The x symbol marks elements that are subject to permutations but not changes in 
value. Those elements that are subject to both permutation and changes in value 
are indicated by the 0 symbol. Elements in the pivot row are marked with the G 
symbol and the symbol 0 denotes elements beneath the pivot. White space indicates 
zeros, q is the pivot, and any p, was a former pivot (in stage i). Let k = 7. Then 
the leftmost 7 columns of R 7 are already fixed in upper triangular form and L 7 is 
unit lower triangular with the special form described above. L’pon entering stage 
(A- + l) = 8 of the Gauss factorization process, the matrices L 7 and R 7 would appear 
as shown below: 



l- = 



1 

0 1 

0 0 1 
0 0 0 1 
0 0 0 0 
0 0 0 0 
0 0 0 0 
© e © e 

X X X X 
X X X X 
X X X X 
X X X X 
X X X X 
XXX 
XXX 



1 

0 1 

0 0 1 

©eei 

XXX 

XXX 

XXX 

XXX 

XXX 

XXX 

XXX 



1 

1 

1 

1 

1 



(3.24) 
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(3.25) 



With this illustration in mind, let us prove the effect of GF. 



Proposition: Given A 6 3? nxn . Let L, € 3?" Xn be the unit lower triangular matrix 
with 7 n _i — the (77 — i)x ( n—i ) identity — as its lower, right-hand block. Let 7?, € 3? nXn 
be the matrix that is upper right triangular in its leftmost i columns. Initially, let 
A = L 0 Ro with L 0 — I and Rq = A. Let P(k) be the proposition: “Stage k of the 
Gauss factorization process yields the factorization, A = L^Rk-" 

To Show: P(k) => P(k + 1) for 0 < k < (n — 1). 

Assumptions: Pivoting, according to any valid strategy, is performed outside of 
this factorization procedure and the pivoting strategy yields pivots, a ^ 0. 
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Notation: We can partition A so that 



A = 



r t 
a y J 



(3.26) 



where o G S is the initial pivot, x G 5?" _1 holds the values beneath the pivot, 
y G 5? n_1 holds the values of the elements in the pivot row to the right of the pivot, 
and G G is the Gauss transform area. 

Basis for Induction: We must show that P( 0) =t> P(l). P(0) means that Lo = I n 
and Rq = A. That is, Rq has no special structure except (by assumption) we are 
guaranteed a nonzero pivot a. Consider stage k = 1 of Gauss factorization. Let us 
partition A as above and factor 



A = 



T 1 
q y J 




' 1 


o T ' 




p r T 


x G 




i 


i 




0 B 



= L\R\ 



(3.27) 



where B, f, r, and p (with the obvious sizes) are defined as 



p — a 


(3.28) 


r = y 


(3.29) 


= t) x 


(3.30) 


= G-ir T 


(3.31) 



Thus, given A = Lo-Ro, Gauss factors A = L\R\ and .P(O) => -P(l). 

Inductive Step: Consider the matrices L ^ and Rk that are submitted to stage 
(A*+ 1) of a Gauss factorization procedure. We make the inductive step to show that 
P{k) P(k +1). For 0 < k < n, A = L^Rk may be partitioned so that 



A = 



' L 


0 


0 


m T 


1 


0 


N 


0 


I 



R 

0 T 

0 



q y J 

x G 



= LkRk 



(3.32) 
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where L E 3t kxk is a unit lower triangular matrix and R E ?R. kxk is a right (upper) 
triangular matrix with nonzero diagonal elements. 

The Gauss process forms p as in (3.28), r as in (3.29), multipliers, £ as 
in (3.30), and D as in (3.31). Then, for 0 < k < (n — 1), GF forms 





O 

o 




so 


A = 


m T 1 0 




0 T p r T 




NCI 




1 

e> 

o 

o 
1 



= Tjt + l /?* + !• 



(3.33) 



Thus, for 0 < A- < n, P(k) => P(k + 1). [Ref. 24] 



Conclusion: The nonsingular matrix A E 3R nXn can be factored, in n steps of the 
Gauss factorization process, so that A = LR with L being unit lower triangular and 
R being upper triangular with nonzero diagonal elements. 

The proof has demonstrated the effect of GF. For simplicity, it excluded 
the pivoting strategy (simply assuming that, at every stage, a pivot a ^ 0 would be 
available). It also held A square. In this sense the proof is somewhat specific. There 
is a more general conclusion to be made. This conclusion holds for GF with pivoting 
and 0 / A E 3? mXn and it is absolutely essential to understanding the factorization. 

4. The LR Theorem 



With the GF process complete, and the vast majority of the work done, 
we show how to form a solution from our factorization. Various methods of pivoting 
(resulting in permutation vectors) and the method whereby A is factored have been 
discussed. To solve the system, we must put all of this information together. The 
key is the LR Theorem [Ref. 24]: 

Theorem 3.1 (LR Theorem) Let 0 / 4 E 3f? mXn . Then there are permutation 
matrices P E $ft nXn and Q E 5? mxm , an integer r > 1, a lower trapezoidal matrix 
L E $R 7nxr and an upper (right) trapezoidal matrix R E 3R rXn so that Q T AP = LR. 
The diagonal elements of L satisfy A,., = 1 with i = 1,2, ...,r and the diagonal 
elements of R satisfy p x { ^ 0 for i = 1, 2, . . . , r. 
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5. Filling in the Blanks 



a. The Main Factors 



GF used the space of A to hold the two principal matrices, L and /?, 
in the factorization of A. To see them, we will extract the lower triangular matrix, 
L, and upper (right) triangular matrix, R, from the final copy of A (3.23). Initially, 
let L = R = 0. We form L by placing ones on its diagonal and filling the elements 

below the diagonal from the corresponding locations in A. 

1 

5/9 



L = 



0 0 0 

1 0 0 

1 9/37 1 0 

. 5/9 1/37 15/61 1 . 

R is formed with the diagonal elements (i.e., pivots) and upper triangle of A. 

9 7 2 4 

0 37/9 26/9 34/9 

0 0 122/37 114/37 

0 0 0 -15/183 



(3.34) 



R = 



(3.35) 



b. Permutation Matrices 



The bookkeeping allows us to construct P and Q very quickly. To form 
P € 9R nxn , we set every column, j , in P equal to the axis vector implied by i ry, the 
j th element of p. This yields the permutation matrix, P, that will satisfy the LR 
Theorem, namely 



' ?Tl ‘ 




' 4 ■ 


7T 2 




3 


*3 




1 


. 7T-4 . 




. 2 . 



P = 



1 4 1 3 t\ 




0 0 10 
0 0 0 1 
0 10 0 
10 0 0 



(3.36) 



Similarly, every column, j, in Q 6 3? mXm is set equal to the axis vector implied by 
i/>j, the j th element of q. For our example, we have 



’ t/>i ‘ 




' 3 ‘ 






' 0 
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1 ' 


t/> 2 
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=» <? = [ e 3 e 2 e 4 ej 
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1 
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*/>3 
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0 


. 04 . 




. 1 . 






. 0 
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0 . 



(3.37) 
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c. Check 



Now we check to make sure that our solution satisfies the LR Theorem. 
First, consider the product LR: 



LR = 



And 
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0 0 
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4 l 
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0 0 
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9/37 


1 0 






0 
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— 15/183 . 
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Q t AP = 
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0 ' 




■ 2 


3 


4 


5 ‘ 




‘ 0 


0 


1 


0 ■ 


0 


1 


0 


0 




4 


6 


8 


5 




0 


0 


0 


1 


0 


0 


0 


1 




2 
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. 1 
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= (Q t A)P = 



’ 2479 " 




’ 0010 ' 


4 6 8 5 




0 0 0 1 


6 8 8 9 




0 10 0 


. 2345 . 




.1 0 0 0 . 



9 7 
5 8 
9 8 
5 4 



2 

4 

6 

2 



4 

6 

8 

3 



Our factorization satisfies Q T AP = LR. 



( 3 . 38 ) 



( 3 . 39 ) 



( 3 . 40 ) 



( 3 . 41 ) 



( 3 . 42 ) 



d. Solution 

Now we solve the system. Recall that Gaussian elimination operated 
on the matrix, A , and the right-hand side, fc, at the same time. The end result of 
GE is that A is reduced to upper triangular form by successive elimination of the 
lower triangle so that we could solve for u with a relatively easy back substitution. 

The strategy of Gauss factorization is different. First, b is not part of 
the factorization process. Secondly, even though we are changing A, we know that 
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we can get it back at the end (if we want to), so there is no need to save the original 
A. Now, using the LR Theorem, we complete the solution. Recall that the original 
system was 

Au = b. (3.43) 

The factorization process constructs permutation matrices P and Q and transforms 
the original matrix A into a combined version of L and R. Further (by the LR 
Theorem) we know that these matrices satisfy 

Q t AP = LR. (3.44) 

Now, by multiplying (3.44) through by Q from the left and P T on the right, we see 
that 

QQ t APP t = QLRP t . (3.45) 

Performing the cancellations on the left-hand side, we have 

A = QLRP t . (3.46) 

This is the factorization of A. Substituting this into (3.43) yields 

QLRP t u = b (3.47) 

or 

LRP t u = Q T b. (3.48) 

Now let b = Q T b and let u = P T u. Then 

LRu = b. (3.49) 

Further, let Ru = c for some unknown vector, c. We have 

Lc = b. (3.50) 



5$ 



Since we know L and 6, we may solve for c by a simple forward substitution. Then, 
using c and knowing that Rii = c, we perform a simple back substitution and deter- 
mine u. Finally, by definition, u = P T u (i.e., u is a mere permutation of u) so we 
can swap elements in ii to arrive at u using Pii = u. 

Let us summarize this lengthy process into the main steps. The GF 
process factors A = QLRP T , changing the general matrix into a product where the 
most significant factors are both triangular. This reduces the hard problem to two 
easy ones. It is designed so that we can solve for u in two steps: 

• Solve, by forward substitution, the system Lc = b for a vector, c, of unknowns. 

• Solve, by back substitution, the system Ru = c for (a permutation of) the 
original unknowns, u. 

So, for our example, the first step is to solve 



' 1 0 00 ' 
5/91 00 




' Ci 

Cl 


= Q T b = 


‘ 13 ' 

-5 




'A ' 
02 


1 9/37 1 0 




c 3 


-17 




03 


. 5/9 1/37 15/61 1 . 




. c 4 . 




0 . 




.A . 



(3.51) 



Forward substitution, applied to this system, yields 



' Cl ' 




13 


C 2 




-110/9 


c 3 




—1000/37 


- c 4 . 




—15/61 



(3.52) 
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Now we know' c, so we can solve the second triangular system, Ru = c for u by back 
substitution 



‘9 7 2 4 




‘ Vl ' 




13 


0 37/9 26/9 34/9 




v 2 




-110/9 


0 0 122/37 114/37 




c 3 




-1000/37 


.0 0 0 -15/183 . 




. v 4 . 




— 15/61 



(3.53) 



which yields 
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CO 
1 



(3.54) 



Now it is easy to recover u. Since w r e have defined u = P T u , we know' 
that Pu = u (a simple rearrangement of the elements that we have already found). 
We apply P to u and find that 
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Comparing this to earlier solutions, we find that GF has arrived at the same solution. 

In these examples, the notion of elimination w r as developed first. The 
GE process performs successive eliminations beneath its pivots and reduces A to 
triangular form, and then the solution is available in only n 2 flops. GF spends 
an almost identical amount of work in the reduction process, but the result is a 
factorization with L and R being the significant factors. (They are the only ones 
that are more than a permutation of the identity). In the examples, we used pivoting 
because it was practical. Now let us take a closer look at the justifications for 
pivoting. 



60 



F. PIVOTING FOR SIZE 



The issue of pivoting is a very interesting and important one. We concluded that 
we must pivot or face the possibility of attempting to divide by zero, an unacceptable 
option. To solve this problem, we may pick any nonzero element in A(k: m ,k : n) 
and perform the column and row interchanges required to install it as the new pivot 
(k is the pivot index). There are many strategies that we could adopt. 

The logical question would be something like: “Given that we must pivot, what 
is the best means available? 5 " But the answer is not so easy, and there are many 
trade-offs to be considered. We are faced with choosing along a spectrum, where 
speed lies at one end and accuracy lies at the other. For instance, w’e could begin a 
search and pick the first nonzero element in this area. Or, we could search for the 
row with the most nonzero elements (that had a nonzero element in the k ih column). 

The two most common strategies for pivoting are the partial and complete meth- 
ods, which we have discussed. We determined that partial pivoting would work per- 
fectly (with no error) if A was nonsingular and the storage and arithmetic could be 
handled with infinite precision. If infinite precision were available, we could stop 
right here. There would be no need to try to refine the method. In a finite-precision 
machine, however, we must deal with the issue of errors. 

To deal with errors, the problem must be stated more precisely. The errors 
that concern us would arise due to growth of the elements of L and/or R as we step 
through the stages of Gauss. In the end, partial pivoting guarantees that all of the 
elements of L will be, at most, unity. This is easy to see. The pivoting strategy 
chooses each pivot to be the largest element (in absolute value) in column k at or 
below row k. This value is installed at A(k, k) and everything below the pivot is 
divided by the pivot. 
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Unfortunately, partial pivoting cannot make the same guarantee for the ele- 
ments of R. It helps: the multipliers are less than or equal to one in absolute value. 
The elements of R are bounded by 2 n-1 a, where a is the largest absolute value of 
the elements in A. This bound is not normally attained “in practice”. [Ref. 23] 

Growth is an indicator of trouble in this process. If we cannot control it com- 
pletely, we should, at a minimum, monitor it. The growth factor, g(n), of a Gauss 
factorization process for A E 3f TlXn is defined as follows. Let a be the largest absolute 
value in the original matrix, A. Let b be the largest absolute value that occurs in 
any Gauss transform, G , including the first one, G = A. Then g(n) = b/ a gives a 
growth factor normalized by a (i.e., g(n) > 1). 

A great deal of analysis has been done on this subject. Wilkinson showed 
that, with complete pivoting and real matrices , g(n) grows much more slowly than 
2 n . He conjectured that g(n) < n. The latter has recently been disproved, with a 
counterexample by Nicholas Young. [Ref. 23] 

As a practical matter, when one seeks to monitor growth one uses complete 
pivoting. To consider performance, one uses the partial pivoting strategy. The 
growth factor, g(n), is easy to monitor with a complete pivoting strategy since we are 
moving through the entire Gauss transform area at each stage anyway. For clarity, 
the pivoting algorithms and the Update algorithm are listed separately in this 
chapter. In real code (e.g., Appendix F), however, the pivot for stage (k + 1) should 
be located during the update of G in stage k (to avoid unnecessary passes through 
the matrix). This would mean extra work in the partial pivoting algorithm. Since 
the primary reason for using partial pivoting is performance, it is counterproductive 
to monitor g(n) while using partial pivoting. A description of both pivoting policies, 
in algorithm form, follows. 
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Algorithm 3.1 (Partial Column Pivoting for Size) Given the matrix of coef- 
ficients, A £ a permutation vector, q £ 3f? m ; and an index, k, indicating the 

pivot column, this algorithm performs partial pivoting. First, the pivot element is 
located at A(s,k) with s > k. Once the pivot has been located, rows s and k are 
swapped to install the new pivot. Additionally, elements in q, indexed by s and k, 
are swapped to record the row interchanges. 

begin PP 

s = k; 

for i = (k + 1) : m 

i((\A(i,k)\>\A(s,k)\) 

s = i; 

end if 

end for 

if (s ? k) 

for j — 1 : n 
x = A(k,j ); 

A{s,j) = x- 

end for 

* = ?(*); 

q{k) = q(s ); 

<7(5) = i; 

end if 
end PP 
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Algorithm 3.2 (Complete Pivoting for Size) Given the matrix of coefficients, 
A G 3? TnX "; permutation vectors, p G $R n and q G ; and an index, k, indicating the 
pivot row and column, this algorithm performs complete pivoting. First, the pivot 
element is located at A(s,t). Once the pivot has been located, rows s and k and 
columns t and k are swapped to install the new pivot. The permutation vectors are 
updated accordingly. 

begin PC 

5 = k; 
t = k] 

for i = k : m (locate the pivot) 

for j = k : n 

if (|A(i,i)| > |A(s,OI) 

s = 2 ; 

t = j; 

end if 
end for 
end for 

if (s k) (row interchanges) 

for j = 1 : n 

x = A(kJ)\ A(k,j) = A(s,j); A(s,j ) = x; 

end for 

1 = q(k): q(k) = q(s)\ q(s) = i) 

end if 

if ( t k) (column interchanges) 

for 2 = 1 : m 

x = A(i,k ); A(i, k) = A(i, t); A(i,t) = x; 

end for 

2 = p(k); p(k) = p(t); p(t) = i; 

end if 
end PC 
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G. SEQUENTIAL ALGORITHMS 



The examples considered have described the Gauss process. We first considered 
elimination (GE) and then a factorization method (GF). Both methods require work 
of the same order, so the latter, yielding a factorization of A is much preferred. 
Algorithms for the GF process are described below. The arithmetic in the Gauss 
transform area, G, is performed the same (regardless of pivoting strategy) so a 
separate algorithm is given for updating G. The algorithms GFPP (pivoting, partial) 
and GFPC (pivoting, complete) are given following the updating algorithm. These 
algorithms are adapted from Gragg [Ref. 23], 



Algorithm 3.3 (Update Gauss Transform Area) Given the matrix of coeffi- 
cients, A € 3? mXT1 ; and k, the pivot column, this algorithm performs the appropriate 
arithmetic throughout the pivot column and Gauss transform area, G, of A. 



begin Update 

x = A(k, k)\ 

for i = (k + \) : m 
A(i, k) = A{i , k)/: r; 

end for 

for i = (k + 1) : m 
x = A(i, k)\ 
for j — l : n 

A(i,j ) = A(i,j) - xx A(k,j ); 

end for 
end for 
end Update 



(x is the pivot value) 
(pivot column division) 



(arithmetic in G) 
(now x is the multiplier) 
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Algorithm 3.4 (Gauss Factorization with Partial Pivoting) Given the matrix 
of coefficients, A G 3? nX ”, this algorithm modifies (overwrites) A with a unit lower 
triangular matrix (with an implicit diagonal), L G 3i nxn , and an upper (right) trian- 
gular matrix, R G 3? nxn having nonzero diagonal elements (the pivots). The process 
also forms the row permutation vector, q, and the corresponding permutation matrix, 
Q G 3R nXn , that results from partial column pivoting for size. The algorithm gives 
the factorization: Q T A = LR. 

begin GFPP 

n = order(A) 

Q = zeros (n, n) 
for j = 1 : n 

q(j) = j] (initialize q) 



end for 



for r = 1 : n 



(the Gauss process) 



PP(A,q.k) 



(pivoting) 



if (A(k,k) = 0) 

print “A is singular!” 
exit 



end if 



Update(A, k ) 



(Update G) 



end for 



for j = 1 : n 

Q(qU)J) = 1 - 0 ; 



end for 



end GFPP 
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Algorithm 3.5 (Gauss Factorization with Complete Pivoting) Given a ma- 
trix of coefficients, A € 3R mx ", the following algorithm modifies (overwrites) A with 
a unit lower trapezoidal matrix (with implicit diagonal), L € 9? mx ", and an upper 
(right) trapezoidal matrix, R £ 3ft mX ". The diagonal elements of R are nonzero (piv- 
ots). The process forms permutation matrices, P £ 9? nxn and Q £ 3f? mxm , to reflect 
the complete pivoting for size. These matrices are formed to satisfy the LR Theorem: 
Q t AP = LR. 



begin GFPC 

m = rows(A); n = cols(A); 

P = zeros(n, n); Q = zeros(m, m); 
for j = 1 : n 

p(j) = j; 

end for 

for i = 1 : m 
q(i) = i] 

end for 

for r = 1 : n 

PC (A,q,k) 

if (A{k,k) = 0) 

print l 'A is singular!" 

exit 
end if 

Update(A, k) 
end for 
for j = 1 : n 

P(p(j),j) = 1-0; 

end for 
for j = 1 : m 

= 10 ; 

end for 



(initialization) 



(the Gauss process) 
(pivoting) 



(Update G ) 



(form P) 



(form Q) 



end GFPC 
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H. CONJUGATE GRADIENTS 



Time permits only a brief synopsis of the method of conjugate gradients (CG). 
This method was described by Magnus R. Hestenes and Eduard Stiefel [Ref. 18]. 
CG possesses some very nice characteristics and it is quite different from the Gauss 
method. Once again, we begin with a system of linear equations 



Au = b (3.56) 

The algorithm given by Hestenes and Stiefel is designed for A € 3£ nXn symmetric 
and positive definite (Appendix A). Let s 6 3?” be the vector that would solve (3.56) 
exactly, so that As = b. Let u,- (E be the estimate of the solution, s, produced 
in the i th iteration. The original estimate, u 0 , is merely a guess (it may be a good 
guess). For instance, in the absence of better information, we could choose u 0 to be 
the vector of all zeros or all ones. 

The CG process takes our initial guess and develops a (guaranteed) better 
estimate for the next stage. To measure the progress, we could use the residual 
vector 

r< = b — Am (3.57) 

but Hestenes and Stiefel warn that its Euclidean norm, || r,- || 2 , may actually increase 
in every step but the last! A more reliable measure, called the error vector 

e, = s — U{ (3.58) 

has monotonically decreasing length. After n iterations of the CG process, we are 
guaranteed to have a very good estimate u n of s. In fact, if no rounding errors 
occur, we have u n = s. In practice, CG can find a very good estimate, u m , of s 
in m iterations, with m <C n. The process “terminates in at most n steps if no 
rounding-off errors are encountered.” [Ref. IS: p. 410] 
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The algorithm below is adopted from Hestenes and Stiefel [Ref. 18]. Before 
considering the algorithm, however, we should define the key term, conjugate. For 
A symmetric, two vectors x E and y 6 3? n are said to be A-orthogonal (or 
conjugate ) if the relation x T Ay = (Ax) 7 y = 0 holds (Ref. 18 : p. 410]. This is 
an extension of vector orthogonality, x 7 y = 0. The algorithm given below is very 
simple. The iteration blindly proceeds from i = 0 to i = n. A more sophisticated 
(finite precision) scheme would set a tolerance (notion of “good enough”) and stop 
(exit the loop) when this criterion was satisfied. 



Algorithm 3.6 (The Method of Conjugate Gradients) Given the symmetric, 
positive definite matrix of coefficients, A £ 3J rixn ; and an initial guess, u 0 ; for the 
solution, s; of the system Au = b, this algorithm (in the absence of rounding-off 
errors) finds u; = s in i iterations ( i < n). The algorithm keeps track of a residual 
vector, ri, and direction vectors, p, . The residuals, r are mutually orthogonal and 
the direction vectors , p t are mutually conjugate (A-orthogonal). 



begin CG 

u o =zeros(n) 

Po = r 0 = b - Au 0 

for i = 0 : n 
6 = pj Ap , 
a. = (pJr t )/S 
U.-+1 = U. + ct,p, 
r,+i = r,- - a { Ap, 

fix = ( r T+\ r i)/f> 

P ,+ 1 = n + l + fixPx 

end for 
end CG 



(arbitrary initial guess) 

(denominator used below) 
(scalar multiplier used below) 
(estimate of solution) 
(residual vector) 

(direction vector) 
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I. SUMMARY 



This chapter develops the Gaussian elimination process, the Gauss factoriza- 
tion process, pivoting strategies, and (briefly) the method of conjugate gradients. 
Each of the corresponding algorithms possesses potential for parallel solution. A 
parallel implementation of GF appears in the following chapter. Both partial and 
complete pivoting are pursued, with further discussion on their implications in a 
parallel environment. 
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IV. PARALLEL DESIGN 



Nature is pleased with simplicity, and affects not the pomp of superfluous 

causes. 

— SIR ISAAC NEWTON (1642-1727) 

Sequential algorithms for Gauss factorization (GF) and the method of conjugate 
gradients (CG) are established in Chapter III. The goal of this chapter is to show 
parallel algorithms for Gauss factorization. The C programs that implement these 
algorithms are discussed in Chapter V and listed in Appendix F. 

Parallel algorithm design is a process that includes many considerations. The 
question of how to achieve parallelism is largely an art and is not discussed here. 
The method used in this research is often called a workfarm approach because the 
algorithm farms out work to processors. Equivalently, it may be called a manager- 
worker model . When we distribute the problem across many processors in a workfarm 
style, there are quite a number of issues that warrant careful consideration. The 
concerns associated with programming a parallel machine — even with a relatively 
simple model such as this — could occupy volumes. 

Communications, load balancing, granularity, and other considerations abound. 
Metrics like speedup and efficiency should be used to lend credibility to the parallel 
nature of the algorithm. Additionally, we should consider the usual issues of main- 
tainability, readability, portability, and other traits commonly associated with good 
(sequential) programming practice. Parallel codes must be clear combinations of 
sequential codes that are joined together in a logical manner. Simplicity should hold 
a place of great esteem in a parallel algorithm. The rest of this chapter introduces 
the issues of parallel design, particularly as they pertain to Gauss factorization. 
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A. INTERPROCESSOR COMMUNICATIONS 



Interprocessor communication is one of the most fundamental issues in parallel 
processing and, quite possibly, the most involved. Without a means of communicat- 
ing (in a message-passing environment), the multiprocessor system is meaningless. 
The implications of any communications scheme are many and the interactions can 
be quite complex. Exhaustive coverage of this issue is out of the question, so we will 
consider a few of the most essential ideas. 

1. The Network 

A network is the part of a multiprocessor system’s hardware that bears 
the interprocessor communications burden. It is a combination of nodes and links 
that connect those nodes, and it is the foundation upon w’hich all communications 
must build. We will also refer to the nodes of a multiprocessor — using somew’hat 
loose terminology — as processors. The term node is a more general term. Nodes 
are typically more sophisticated than a simple central processing unit (CPU) or, for 
that matter, any other sort of processor. The link is a wire that connects two nodes. 
An interconnection topology describes the pattern of links used to connect the nodes 
of a network. The network can be drawn or illustrated so that we can see how its 
nodes are connected. Appendix C discusses interconnection topologies and it gives 
a description (and illustrations) of the particular scheme used in this research: the 
hypercube. 

Intel combines an 80386 CPU with an 80387 math coprocessor and commu- 
nications facilities to form a U CX” node for the iPSC/2 that was used in this research. 
INMOS provides the same general capabilities but packages it all on a (very sophis- 
ticated) single chip, called a transputer. Figure 4.1, from INMOS’ T9000 Transputer 
Products Overview’ Manual [Ref. 25: p. 31], shows a high-level block diagram of the 
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components of a T9000 transputer. Thus, any node of a message-passing multipro- 
cessor system can be thought of as a combination of computing and communications 
facilities. It may possess other capabilities as well. 

2. Message Routing 

The machines used in this research exhibit different message transmission 
schemes. The transputer system employs high-speed (20 megabits per second) point- 
to-point serial communications and store-and-forward message passing. That is, for 
multi-hop communications, each node along the w'ay must receive the message, store 
it in local memory temporarily, and then pass it to the next node in the route. 

The Intel iPSC/2 uses another technique, called circuit switching or direct- 
connect communications. This approach is much like our telephone system. First, 
the originator of the message sends a small message containing information about 
the message (e.g., destination node number, length of message) to the destination 
via the nodes in-between. As this small header packet makes its way to the destina- 
tion the nodes along the w’ay flip switches, closing a circuit from the sender to the 
receiver. Once this circuit is established, the message proceeds from the sender to 
the destination without interruption. 

Each method has its advantages and disadvantages. The circuit switching 
approach allows for few r er interruptions along the way, but it ties up the entire path 
for the duration of the communication. The store-and-forward method imposes 
delays for storing the message into, and then retrieving it from, the memory of every 
node along the way. (A more complete description of these two techniques, together 
with experimental results, is given in Appendix B). For the algorithms employed in 
this research, almost all communications w r ere “nearest neighbor” in the hypercube. 
In this case, the two approaches to message routing are insignificant and the nearest 
neighbor performance becomes more important. 
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Figure 4.1: IMS T9000 Block Diagram 
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3. Concurrent Computing and Communicating 



The nodes of a multiprocessor machine should be able to both compute 
and communicate efficiently and concurrently. This is no small undertaking. The 
computing side must access memory to accomplish its mission, but the message- 
passing begins by drawing data out of memory and ends by storing data into mem- 
ory. Therefore, at a minimum, we have competition related to memory accesses. 
Furthermore, the computing and communication must be synchronized to some ex- 
tent. The algorithms used in this research used blocking communications — described 
in Appendix E — which enforces synchronization. 

There are overheads associated with communications and this synchroniza- 
tion problem. Bryant showed how transputers perform under various communica- 
tion loads [Ref. 26] and this is mentioned in Appendix E. The issue of overheads 
is one that Charles Seitz considered for the “Cosmic Cube.” Much, but not all, of 
the overhead is communication-related. Seitz listed three of the major problems 
[Ref. 27: p. 28]: 

(1) the idle time that results from imperfect load balancing, (2) the wait- 
ing time caused by communications latencies in the channels and in the message 
forwarding, and (3) the processor time dedicated to processing and forwarding mes- 
sages, a consideration that can be effectively eliminated by architectural improve- 
ments in the nodes. 

Included in these costs, we should also recognize that some amount of time is required 
for the processor to perform “context switching” (changing jobs) and/or coordination 
with a special-purpose processor that we might call the communications manager. 

Although the issue of concurrent communication and computing is a very 
complex one, we may consider significant issues that are related to the efficiency of 
communications and the effect upon the processor. Geoffrey Fox presents the notion 
of comparing communications ability to processing ability [Ref. 28: pp. 50-51]. Let 
t ca ic be “the typical time required to perform a generic calculation. For scientific 



75 



problems, this can be taken as a floating-point calculation a = b x c, or a = b + c.” 
Furthermore, let i comm be “the typical time taken to communicate a single word 
between two nodes connected in the hardware topology.” Then the ratio 

^cornm 

i-calc 

is a general characteristic of a particular system that can be quite useful in comparing 
machines. Fox uses this ratio in much of the rest of his work. 

A parallel machine must necessarily possess a capable communications sub- 
system, but this is not enough. The program should also make prudent use of the 
communications facilities. This means that the programmer and/or compiler must 
exhibit a good understanding the machine’s communications abilities and weak- 
nesses. Some characteristics are nearly universal. Most machines, for instance, 
reward the use of long messages because there is an overhead — nearly independent 
of message length in man}’ cases — to sending any message. Other characteristics are 
very much machine-dependent. This means that the programmer should be rela- 
tively familiar with the communications abilities and characteristics of the target 
machine. 

4. Accessing the Clock 

The ability to accurately measure the time required by communications 
and computations, preferably at the host and every node in the system, is absolutely 
essential in a multiprocessor environment. Profiling , in a sequential program, allows 
us to compare the time required by various parts of a program. Timing in a parallel 
environment allow’s us profile the code. Thus we can determine the time required for 
instructions, loops, functions, or communications. 

Profiling is an even more important practice for parallel coding than it is in 
the sequential case. The only way for a parallel program to be useful is if it can be 
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can be implemented efficiently upon an acceptable number of processors. That is, 
in general, the only object in choosing a multiprocessor system over a sequential 
machine is the speed with which computation can be performed. One of the best 
tools available to the parallel programmer is the ability to see where and how much 
time is being spent. 

At a minimum, we need the ability to sample a clock with reasonable preci- 
sion. Both machines and compilers used in this research provide this capability (see 
timing. h in Appendix F for details). The transputers offer a choice of frequencies: 
the clock associated with low priority processes has a period of 64 microseconds and 
the high priority clock offers one microsecond ticks. The iPSC/2 mclock() function 
gives time in milliseconds. 

B. METRICS FOR PARALLEL COMPUTING 
1. Complexity 

Perhaps the most obvious measures for a parallel algorithm are simply 
those that we use for sequential algorithms. We want to keep time and storage 
requirements to a minimum. Perhaps the major difference in complexity analysis 
for a parallel algorithm is that we are primarily interested in a per-processor notion 
of complexity. If the problem has been farmed out in a fair manner, complexity 
analysis for the parallel case is merely an extension of the sequential case. 

Consider the matrix A £ Suppose that its elements are 8-byte, 

double-precision, floating-point values (type double in C). Let M p denote the total 
memory (in bytes) required to store A on p processors and let T p denote the time 
required for p processors to solve the system characterized by A. Then M i = 8n 2 
bytes of storage, but (ideally) Mg = n 2 . When the problem is distributed across p 
processors simultaneously, the processors can share the storage burden. 



Exceptions abound. For certain problems, it may actually be convenient 
(faster or more reliable) to store the entire matrix at each processor. Nevertheless, 
in most cases we would like to minimize local memory requirements. The Gauss 
factorization algorithm considered near the end of this chapter is no exception. In- 
deed, the transputers used in this work had only 32 kilobytes of storage each and 
the results of Chapter VI for transputers show how this can dictate the size of the 
problem that can be executed. The concepts of time and storage complexity have 
been developed in detail for sequential algorithms and they seem to hold a place in 
parallel algorithm assessment as well. We consider other measures that have been 
developed for parallel computing in the following section. 

2. Contemporary Measures 

The concepts of speedup and efficiency (Appendix A) are two of the most 
common performance measures currently associated with parallel computing, with 
the ideal case (100% efficiency) yielding tp = t\/ P on a P-processor system. Selim 
Akl proposes the following criteria for analyzing algorithms [Ref. 29: pp. 21-28]: 

• Running Time: Running time t(n) is the time required to execute an al- 
gorithm for a problem of input size n. Akl lists three ways to express this 
notion. First, we may count the steps in an algorithm. Akl distinguishes be- 
tween computational steps (i.e., something like flops) and routing steps that 
are associated with interprocessor communication. Second, we have lower and 
upper bounds (e.g., the complexity notation presented in Appendix A). Fi- 
nally, we have speedup. Akl gives the usual definition of speedup but clarifies 
it somewhat (details below). 
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• Number of Processors: Second in importance, Akl considers the number of 
processors required by an algorithm. He uses p(n) to denote the number of 
processors required for a problem of size n. 

• Cost: Akl defines the cost , c(n) for a parallel algorithm as the product of the 
first two factors. That is, c(n ) = t(n) x p(n). 

• Other Measures: In this category, we have no less than three other qualities 
of a parallel system that deserve consideration. The area (i.e., chip real estate) 
required by the processors is significant. The length of the links, as well as 
any patterns figures in (regularity and modularity). And finally, the period 
between processing different elements of an input is important. 

Apparently metrics for parallel computing are still developing. There are several 
very useful concepts such as speedup and efficiency. The definition of speedup, at a 
first glance, is rather standard. It doesn’t take much probing, however, to find that 
different authors make different assumptions. Akl defines speedup S in the usual 
manner, 

s = r (4.i) 

tp 

except that he is somewhat more specific about the times. He defines t] as the 
“worst-case running time of fastest known sequential algorithm for problem’' and tp 
as “worst-case running time of parallel algorithm.” (Ref. 29 : p. 24] He has been 
more specific than most authors, but it seems likely that the algorithms, method of 
obtaining times and fp, and systems should also be specified. Speedup is defined 
loosely in most cases. A parameterization to accompany speedup would be tedious, 
but useful. Until speedup becomes a standard term with accepted meaning, we shall 
have to specify exactly what it means. We should be more careful with this term. 
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3. Other Ideas 



Akl has appropriately distinguished between computational steps and rout- 
ing steps. The term floating-point operations (flops) has become quite popular (along 
with benchmarks ) and this is a useful means of expressing the computational ability 
of a machine (for floating-point applications). The notion of routing , however, is 
somewhat vague. Nevertheless, this idea must be addressed. It should probably 
become more specific as we talk about similar machines. 

The machines used for this research were MIMD message-passing systems. 
We can get much more specific about “routing steps” for such a machine. First, using 
the clock as a stopwatch, we can profile any segment of code (including calculations 
and/or communications). An implementation specific version of Fox’s t comm /t ca i c 
ratio can be instructive. It is important to apply this ratio to the hardware as Fox 
defines it, but it is equally important to recognize the role of the software (algorithm). 
That is, for some specific implementation, we should be interested in finding some 
measure of how much time is spent communicating and how much time is spent 
computing. More specifically, a careful profile could be made of a program in the 
following manner. 

The ratio of cumulative (i.e., over the execution of the entire program) time 
spent communicating to time spent computing should be considered as a first cut, 
especially if performance (efficiency) is weak. Algorithms such as Gauss factorization 
are executed in stages, within a loop of some sort. In this case, the t C omm/<ca/c 
ratio per iteration is an interesting figure (and — if the loop represents most of the 
program’s execution time — this should be approximately equal to the cumulative 
figure). 

When possible, the analysis of communications complexities should be an- 
alyzed carefully. For instance, in the Gauss factorization code that is presented in 
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Appendix F, a C structure is used to relay the owner (node id) of a pivot and the 
pivot’s row, column, and value. This structure is 20-bytes of data and we know 
the pattern with which these structures are moved about during the course of the 
program. It is important to quantify communication like this when possible. The 
vague notation should lose significance in the presence of such concrete information. 

There are other important and related ideas. The frequency and volume 
of communications traffic is easy to determine with a high degree of accuracy for 
algorithms such as Gauss factorization. Once again, in the presence of this kind 
of information, we should dispense with vague concepts. It is useful to consider 
something like a pie chart showing the various amounts of time spent on each portion 
of the major loop in a program. Indeed, this was a part of the development of the 
Gauss code given in this thesis. Tools such as these are important in refining parallel 
algorithms and streamlining code. 

The parallel program designer must consider many other issues regarding 
communications. Graph theory notation is a natural tool. A link-by-link analysis 
of the communications over the course of a program is not out of the question (espe- 
cially if the communication is merely a repetition of very simple messages). Efficient 
use of the topology is important. We should consider the percentage of links used, 
balancing of the communications load, frequency of traffic for each link (often the 
communication comes in bursts and often between iterations of the basic algorithm), 
flow rate (in bytes per second) for each link during the bursts or over longer periods 
of time, timelines showing dependencies, and other specific characteristics of commu- 
nications. Analysis should be done on a per-stage basis for algorithms that exhibit 
iteration (loops). 

Perhaps most importantly, a plan for interprocessor communication should 
begin well in advance, before the code is ever written. A reactive approach is neces- 
sary, like debugging code. But a proactive, strong design effort can simplify matters. 
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The notion of communicating sequential processes (CSP) deserves attention. This 
model is due to C. A. R. Hoare [Ref. 30], and it is never far away in the world of trans- 
puters. There is a very close relationship between transputers, Occam (their native 
language), and CSP. CSP is a useful paradigm for this sort of (message-passing) 
machine. When possible, a problem should be logically separated into processes. 
The division of the problem should be natural, so that every process represents a 
logical group of tasks. The processes are allowed channels to communicate, and these 
channels are implemented as either links in hardware or buffers in memory if, for 
instance, two processes on the same processor wanted to communicate. 

If a problem is designed correctly, we should have substantial amounts of 
work within a process and minimal interprocess communication. If the processes and 
channels are represented as the nodes and edges of a directed graph, we can make 
use of some nice tools and theorems from graph theory. For instance, we should like 
to maximize computation and minimize communications. One natural method is to 
begin with atomic processes and start to build. 

Suppose that we have many such processes (at least as many as processors) 
and we represent them as the nodes of a directed graph. We can assign the processes 
(nodes) a weight that reflects some form of computational difficulty. This should be 
a fairly concrete number, assuming that the task (process) is well-defined. It might 
be the number of flops per iteration, for example. Next, the channels should be 
clearly indicated as weighted, directed edges. The weight should usually be a very 
concrete number as well, like the number of bytes that passes along that channel 
between each stage of a computation. 

This model gives the problem the sort of order that is necessary to keep 
the parallel design simple, logical, and formal (i.e., friendly for proof of program 
correctness). Once the problem has been expressed in such a manner, there are 
many options. For example, we could consider minimum cuts of the flow rates to 
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decide how to efficiently apportion processes to processors. This mapping alone could 
greatly enhance the performance of code. 

It seems that much of the work in this area is rather imprecise and generally 
unacceptable. Granted, parallel design methodology is a relatively recent problem 
but it can be improved substantially. Good parallel designs that consider these kinds 
of issues and express them clearly will likely be in high demand as parallel computing 
machinery develops. 

C. PARALLEL METHODS 



The wide-ranging capabilities of contemporary computing machinery are evi- 
dent. An exhaustive list would demand pages, but most readers could readily name 
several applications that bear little resemblance to each other. For a single, very spe- 
cific machine there is almost no limit to the combinations of sequential instructions 
that it may carry out. Put another way, a particular machine can be designed and 
built in a few months or years depending upon the level of sophistication involved. 
But the different types and purposes of software that may be created to run on that 
single machine are nearly limitless. Consider Householder's comments on the art of 
computation [Ref. 17: p. 1 ] : 

If a computation requires more than a very few operations, there are usually 
many different possible routines for achieving the same end result. Even so simple 
a computation as ab/c can be done ( ab)/c , (a/c)b, or a(b/c), not to mention the 
possibility of reversing the order of the factors in the multiplication. Mathemat- 
ically these are all equivalent; computationally they are not (cf. § 1.2 and § 1.4 )• 
Various, and sometimes conflicting, criteria must be applied in the final selection 
of a particular routine. If the routine must be given to someone else, or to a com- 
puting machine, it is desirable to have a routine in which the steps are easily laid 
out, and this is a serious and important consideration in the use of sequenced com- 
puting machines. Naturally one would like the routine to be as short as possible, 
to be self-checking as far as possible, to give results that are at least as accurate as 
may be required. And with reference to the last point, one would like the routine to 
be such that it is possible to assert with confidence (better yet, with certainty) and 
in advance that the results will be as accurate as may be desired, or if an advance 
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assessment is out of the question, as it often is, one would hope that it can he made 
at least ujxtn completion of the computation. 

— ALSTON S. HOUSEHOLDER 

Parallel algorithms are combinations of sequential ones, so their complexity 
can grow quickly. In general, the hardware issues surrounding parallel problems 
are mature and straightforward. Software, on the other hand, is developing and 
generally difficult to use. 

In addition to the familiar design considerations for a straightforward sequential 
algorithm, the design of a parallel solution must specify: 

• An awareness of the interaction between processing and communication. Fre- 
quency and duration (message length) of communications should be known, if 
possible. Additionally, we should know how this compares to the frequency 
and duration (flops) of computing work. 

• A plan for interprocessor communication; including hardware and software. 

• A scheme for memory usage. 

• The granularity of the problem (i.e., should the processors be given larger or 
smaller “chunks” of work at a time). 

• Load balancing among several processors. 

• A method for accessing input/output resources. 

This is a very high level look at the problem. The issue of communications alone, 
can be more than half of the problem. The simplicity of this short list does not do 
the problem justice. Correct execution, as in the sequential case, is very important. 
But parallel algorithms are subject to the added scrutiny of performance data (e.g., 
efficiency). 
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The methodology for constructing parallel algorithms is a very creative process, 
and there are many questions that can be asked. Is a highly efficient parallel solution 
possible, or is the problem bound by dependencies and sequential work? What is 
the ratio of time spent communicating to time spent computing? How nearly does 
a given algorithm approach the optimal solution? What would happen on some 
other number of processors? Are there any bottlenecks that can be eliminated? 
Nevertheless, the current performance of parallel machines and the promise of fu- 
ture architectures is more than adequate motivation to continue developing these 
products. 

D. ALGORITHMS 

With the preceding concerns in mind, let us consider the algorithm for Gauss 
factorization that was used in this work. The algorithm is given at a very high 
level because detail can be gleaned from Chapter V and from the actual code in Ap- 
pendix F. The first consideration for GF was “How should the work be distributed?” 
There are many options. The matrix could be distributed by rows, or columns, or 
blocks. The method chosen in this case was a distribution of the columns of A across 
the nodes of the machine. The columns were distributed so that column j went to 
processor number j (mod P) in a .P-processor network. 

Such a distribution scheme seems natural for several reasons. First, the work 
associated with the Gauss process moves toward the lower right-hand corner of the 
matrix A € 3ft nxn . By using a modulus assignment, and assuming that n P, we 
have a situation where the load on the processors is nearly balanced for most of the 
process. Second, a column-oriented assignment places the pivot column on a single 
node at each stage. This makes division by the pivot value a simple task. It is 
interesting to note that a similar distribution of A by rows would have merit as w r ell. 
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Once the matrix has been distributed, the code simply moves, in a synchronized 
fashion, from stage to stage of Gauss. At each stage, we must pivot according to 
some strategy. The complete pivoting showed especially poor performance since it 
involved a great deal of communication and synchronization between stages. The 
partial pivoting method allows us to determine which node will have the pivot and 
much less communication is required when this node simply broadcasts the pivot and 
pivot column. After the pivot node divides every element under the pivot by the 
pivot value, it broadcasts the entire pivot column to every other processor. When the 
processors obtain the pivot column, they use the multipliers to perform arithmetic 
in the Gauss transform area, and then proceed to the next stage. 

The following algorithms give an overview of the programs that appear in Ap- 
pendix F. 

Algorithm 4.1 (Parallel GF: Host) At this level, the host code is essentially the 
same for both partial pivoting and complete pivoting. The program is very simple: 
distribute the columns, and then accept them back one-by-one. Let A 6 5R mXn be 
the matrix of coefficients, and let P be the number of processors. This algorithm 
forms the modified copy of A by overwriting the original copy. After the n th column 
is returned from the nodes, we have the factored version of A that can be separated 
into L and R in the usual manner. 

begin GF (Host) 
for j = 0 : (n — 1) 

send A(:,j) to node ( j mod P) 
end for 

for r = 0 : (n — 1) 

receive A(:,r) from node (r mod P) 
end for 

end GF (Host) 
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Algorithm 4.2 (Parallel GFPP: Nodes) Let A € 3R mxn be the entire matrix 
(held at the host). This algorithm is executed on each node in a P -processor network. 
Let the node number be N and let A ^ 6 3? m *’ Xn be the local copy of select columns 
of the matrix A (where m s ~ mjP is the number of columns held locally). Let Gx 
be that part of the Gauss transform area, G, that is held locally. This node receives 
every column, j, of A where (j mod P) = N . 

begin GFPP (Nodes) 
for j = 0 : (mfj — 1) 

receive column and place in A^(:,j) 
end for 

for r = 0 : (n — 1) 
if (r mod P) — N 

perform partial pivoting 
broadcast pivot row index, s, to all nodes 
perform pivot column arithmetic 
broadcast pivot column to all nodes 
else 

receive pivot row index, s, and perform row interchanges 
receive broadcast of pivot column 

end if 

if A T = 0 

send pivot column to host 

end if 

perform arithmetic in Gw 

end for 

end GFPP (Nodes) 



(pivot is held locally) 
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Algorithm 4.3 (Parallel GFPC: Nodes) Let A G 9? mXn be the entire matrix 
(held at the host). This algorithm is executed on each node in a P -processor network. 
Let the node number be N and let Apr € 3? mNX ” be the local copy of select columns 
of the matrix A (where m n ~ m/ P is the number of columns held locally). Let Gn 
be that part of the Gauss transform area, G, that is held locally. This node receives 
every column, j , of A where (j mod P) = N . 

begin GFPC (Nodes) 

for j = 0 : (mx — 1) 

receive column and place in A^(:,j) 

end for 

for r = 0 : (n — 1) 

locate best (local) pivot candidate 

elect pivot (let node Np hold the winner of the pivot election) 
if (Np = N ) 

broadcast pivot indexes, (s,t), to all nodes 
perform pivot column arithmetic 
broadcast pivot column to all nodes 
else 

receive pivot indexes, ( s,t ) 

perform permutations 

receive broadcast of pivot column 

end if 

if N = 0 

send pivot column to host 

end if 

perform arithmetic in G N 

end for 

end GFPC (Nodes) 
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V. IMPLEMENTATION 



A. ENVIRONMENT 

Chapter IV introduces parallel algorithms for Gauss factorization (GF). The 
GF algorithms are produced for partial and complete pivoting strategies. All of 
the programs associated with this research are written in parallel versions of the C 
language and executed on two types of machines at the U. S. Naval Postgraduate 
School. The Math Department’s iPSC/2 afforded eight of Intel’s CX type processors 
arranged in a hypercube topology. The Parallel Command and Decision Systems 
(PARCDS) Laboratory in the Computer Science Department has more than seventy 
transputers available for the experiments. The discussion below gives a more exact 
description of the material and equipment used in the work. 

1. Hardware 

This section describes the machines upon which the work was carried out. 
A general knowledge is assumed, including familiarity with the Intel 803S6 micropro- 
cessor, 80387 math coprocessor, and INMOS transputers. Some of this information 
is provided in Appendix B. 

The hardware used in this research represents the state-of-the-art for the 
mid-to-late 19S0s. These machines are quickly becoming outdated — fitting the his- 
tory of computing — but both INMOS and Intel have more recent, competitive prod- 
ucts in today’s market and fine prospects for future machines. So, while they are 
a bit dated, the products used in this research represent important contemporary 
parallel architectures. 
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Figure 5.1: Hypercube Interconnection Topology: Order n < 3 

a. Networks of Transputers 

The majority of the research was performed upon hypercubes of order 
n G {0, 1,2,3}. These are the usual hypercubes (see Appendix C) and each is 
imbedded in the 3-cube. Figure 5.1 shows this topology. Some of the transputer 
work for this thesis was performed by a network of sixteen IMS T800-20 transputers 
connected in nearly hypercube fashion (Figure 5.2). This is not identical to the 4- 
cube, so it will be called the hybrid cube (it is used as a root with two subtrees that 
happen to be 3-cubes). The subtrees of the hybrid cube can be distinguished by the 
first bit. One of the 3-cubes has labels like Oxxx ; the other is labeled \xxx. 
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Figure 5.2: Hybrid Hypercube Interconnection Topology 
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The rationale behind building the hybrid cube is purely practical. The 
transputers have only four links. Assuming that we define nodes of the hypcrcube to 
be a single transputer, a pure hypercube of order four would be a closed interconnec- 
tion scheme with no opportunity for input or output to or from the system. Here, 
the root node has been inserted between nodes zero (0000) and eight (1000). While 
this deals a horrible blow to the elegance of hypcrcube algorithms — particularly 
communications — it can be used effectively. 

The hardware for the hybrid hypcrcube is configured with code by Mike 
Esposito [Ref. 31]. This gives us sort of an unlabelcd version of the structure that 
appears in Figure 5.2. To make use of this configuration, the nodes must be labeled 
in a logical fashion. The Cray code (Appendix C) is a reasonable choice for labeling 
the nodes. The actual labeling is accomplished by a Network Information File (NIF) 
when the transputers arc loaded by the Logical Systems C Network Loader, LD- 
NET. A more detailed description of this process is contained in the file named 
hyprciibe.nif in Appendix F. 

Networks of transputers use point-to-point communications across bidi- 
rectional links. The links for this work operate at 20 megabits per second (bidirec- 
tionally). That is, ton megabits per second is a peak unidirectional transmission 
rate. Current transputer implementations employ a store-and-forward approach to 
message passing (see Appendix B) for multi-hop transmissions. 

b. Intel iPSC/2 

The iPSC/2 used for this research contained eight processors of the 
U CX” type (S03S6/803S7 combination). The host is an S03S6-bascd IBM-compatible 
personal computer running AT&T UNIX System V (version 3.2). The nodes run a 
local subset of UNIX called NX. The host is capable of supporting many users at 
once, but each node only supports a single-user. 
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Users can request p nodes, where p = 2" for n € {0, 1,2,3}. If another 
user does not already have the requested portion of the cube, the request is granted. 
As long as nodes remain, another user can access them. For instance, one user could 
be working on two nodes and — at the same time — another user could access up to 
four others. While the first two users still possessed these six nodes, a third user 
could get one or both of the remaining two nodes. 

Unlike the transputers, Intel uses a direct-connect circuit switching (see 
Appendix B) approach to multi-hop communications. There is an overhead associ- 
ated with setting up the path for communication, but this cost is nearly the same 
regardless of how many hops the message cross. Once the circuit is established, 
the message can proceed directly from the origin to the destination with negligible 
interference from intermediate nodes. 

c. Host and Root 

The notion of host is similar on both machines, but there is a slight 
difference. The Intel hypercube is directly connected to the host. The transputer 
network, however, uses a substantially different protocol than the typical personal 
computer. Transputers employ point-to-point serial communications, using an li- 
bit link protocol with byte-by-byte acknowledgment. The acknowledge is a two-bit 
packet with dual meaning. The receiving transputer has begun to receive the byte 
and it has storage space for another. 

In the transputer case, host means the PC. We use the term root trans- 
puter to identify the transputer within the host PC that acts something like a host 
to the attached network of transputers. Figure 5.1 illustrates this configuration. An 
IMS B004 extension board in the host PC holds a T414 root transputer. The B004 
is plugged into the PC’s bus and a parallel-serial converter lies between the PC and 
the T414. In Figure 5.1 the “host” is a PC and the “root” transputer is the T414. 
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The iPSC/2 host is simplified, and could almost be thought of as a combination of 
the host and root for the transputer case. Since the entire thesis uses the same pro- 
grams for both machines, the root and host terminology can become confusing. As it 
is not always convenient to express this difference in painstaking detail, I will use the 
terms somewhat loosely. An understanding of the differences between the machines 
should serve to eliminate confusion in every case. When only one of the terms ( host 
or root ) is needed, I have used the correct term. When both of the terms apply, I 
have used them almost interchangeably and they should be interpreted according to 
the machine under consideration. 

2. Software 

The software for this research was written in the C language. The Logical 
Systems C product (version 89.1 of 15 January 1990) was used for the transputer 
implementation. For the iPSC/2 work, the C compiler supplied by Intel was used. 

B. COMMUNICATIONS FUNCTIONS 

Prior to implementing the Gauss algorithms, a substantial communications 
package was constructed. Most of the code for communications appears in the files 
comm.h and comm.c (see Appendix F). As expected, the header file provides 
definitions for manifest constants and specifications (declarations) for the functions. 
An overview of the functions provided in this file is is useful before we discuss the 
Gauss code that called these functions. 

The cubecast() function supports broadcasts from the host to all the nodes 
of a hypercube. Given a hypercube of order n € {0, 1,2,3} with p = 2 n processors, 
this communication is completed in n, or log 2 (p), stages. This has some utility 
in a 3-cube, but imagine the impact in a 10-cube. All 1,024 processors in the 
hypercube would have the message after 10 stages of communication. This function 
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is especially useful at the beginning of a problem, when data must be shipped to 
each of the workers in the network. 

Often we need to gather information in the reverse direction, from the workers 
back to the root. The coalesce() function is one way to accomplish this task. If no 
modification was necessary at intermediate nodes, this operation could be completed 
without interference. In the algorithms that I used, however, there was occasion to 
modify the information along the w T ay back to the root. For this reason, the gathering 
is accomplished using two function calls. First, information is coalesced to a given 
node. Upon return from coalesce(), the data exists locally and may be operated 
upon. When the data is ready for submission, the submit() function is used to pass 
it one step closer to the root. 

A modification of the cubecastQ function that was useful for the Gauss prob- 
lem was cubecast_from(). This function does not assume that the host is the 
originator of the broadcast. Instead, the source is specified as the first argument to 
this function. The function still performs the broadcast in log 2 (p) stages, but it uses 
the concept of a direction to accomplish this. 

The concept of directions in the hypercube turns out to be a fairly useful 
one. For concreteness, consider the 3-cube shown in Figure C.2. Starting at 
any given node, we can specify a direction using one of the three combinations 
d € {001,010, 100}. Suppose that the node’s label is ( and let ® denote the exclu- 
sive OR operation. Then for some direction, d , the number ((Q)d) is the label of the 
node in the direction d from the node l. 

This concept can be applied in general in a hypercube of order n using n-bit 
labels for the nodes and some direction d. The possible directions are all the n 
combinations of (n — 1) zeros and a single one in an n-bit number. Accordingly, 
the code uses directions d 6 {1,2, 4,. . . 2 n-1 }. In most cases, when a direction-by- 
direction approach is desired for all possible directions, we start with one and use 
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the C left shift operator (<<) to produce the other directions incrementally. 

These functions and several others are described in detail in the code of Ap- 
pendix F, but these basic ideas give us a reasonably good introduction at a level 
that is adequate for understanding the algorithms. 

C. CODE DESCRIPTIONS 

A detailed description of the source code used to implement the algorithms of 
Chapter IV is given in the header file gf.h. This header file, located in Appendix F, is 
used by both the partial pivoting and complete pivoting codes. The code for GF with 
partial pivoting can be found in gfpphost.c, the host program, and gfppnode.c, 
the node program. The code for the complete pivoting algorithm is similar except 
for the election of pivots, so most of it has been omitted in the interest of saving 
space. Only the elect_next_pivot() function remains because it is the significant 
difference between the partial and complete pivoting codes. This function appears 
in gfpcnode.c. 
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VI. RESULTS 



A. GAUSS WITH COMPLETE PIVOTING 

The host code, gfpchost.c, and the node program, gfpcnode.c, are written 
to provide a parallel implementation of Gauss Factorization with complete pivoting. 
Since the columns of A are distributed among the nodes of the multiprocessor system, 
the selection of each pivot requires communication. The selection process, in this 
case, begins with each node selecting its own best candidate for pivot. Once each 
of the nodes has made this choice, an election is held to select the best candidate 
among all of the nodes. 

Implementation details for the election process are described in the source code, 
so a detailed description is not given here. Nevertheless, these results show how 
communication — like the election process — can withstand efficient parallel program- 
ming. This program shows how parallel performance can suffer from the effects of 
communications. (Recall Fox’s t CO mm/tcalc and Seitz’s three components of overhead 
from Chapter IV). 

The complete pivoting strategy inserts inefficient communications between each 
stage of the process. The communications themselves are bound to be inefficient since 
the election process finds all nodes of an n-cube participating in an n-stage exchange 
of a 20-byte structure (pivot candidates). In addition to the use of small messages, 
the election imposes an added measure of synchronization upon the problem. This 
allows the processors less independence and forces them to transition between “use- 
ful” program execution and communication more frequently. This transition can 
become burdensome and the processor can eventually find little time to perform 
calculations. 
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In addition to the election process, there is a one-to-all broadcast from the 
node holding the pivot to inform the others of the pivot column values. With an 
m x m matrix A , this message is essentially a column of m double precision floating- 
point values. Doubles for this implementation were eight bytes each, so this is a 
unidirectional broadcast of 8m bytes with exponential fanout. 

The election process — as simple a s it appears — will prove to be an obstacle 
that opposes efficiency. Both the iPSC/2 and transputer systems reward, in terms 
of transmission rates, the sender of long messages. Short messages are essentially 
penalized by the overhead involved in setting up the transmission line and manager. 
Let us consider the results of this complete pivoting strategy. The results from the 
iPSC/2 appear first followed by the transputer results. The largest dimension, n, 
that is recorded is n = 176. The iPSC/2 machine would handle larger problems, but 
this seemed pointless since the performance appears to approach maximum efficiency 
early. 

1. Data for the iPSC/2 System 

Table 6.1 shows the timing data for execution of Gauss Factorization w T ith 
complete pivoting on the Intel iPSC/2 system. 
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TABLE 6.1: EXECUTION TIMES FOR GF(PC) ON THE iPSC/2 



Dimension 

(«) 


Time (seconds) on a 


Hypercube of Order 


0 


1 


2 


3 


S 


0.126 


0.097 


0.092 


0.155 


16 


0.716 


0.674 


0.608 


0.744 


24 


2.20S 


1.751 


1.616 


1.568 


32 


4.627 


3.705 


3.239 


3.149 


40 


9.246 


6.888 


5.895 


5.250 


48 


14.888 


11.479 


9.770 


9.109 


56 


23.686 


17.883 


15.206 


13.796 


64 


36.123 


26.424 


22.326 


19.957 


72 


49.227 


38.178 


31.421 


28.460 


80 


70.546 


50.754 


42.087 


37.810 


88 


89.210 


69.257 


56.803 


51.148 


96 


115.473 


86.760 


72.346 


63.954 


104 


150.915 


110.247 


91.966 


82.680 


112 


182.475 


138.880 


114.486 


102.266 


120 


224.458 


168.056 


139.587 


123.683 


128 


282.491 


206.222 


170.650 


153.379 


136 


339.076 


248.422 


208.745 


186.205 


144 


385.623 


295.217 


241.564 


217.099 


152 


468.763 


345.049 


281.972 


254.538 


160 


527.953 


404.235 


331.653 


292.352 


168 


636.004 


457.089 


381.597 


338.464 


176 


723.596 


532.597 


449.745 


395.008 
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TABLE 6.2: SPEEDUPS FOR GF(PC) ON THE iPSC/2 



Dimension 

(n) 


Speedup on a Hypercube of Order 


1 


2 


3 


8 


1.299 


1.373 


0.813 


16 


1.063 


1.178 


0.962 


24 


1.261 


1.367 


1.408 


32 


1.249 


1.429 


1.470 


40 


1.342 


1.569 


1.761 


48 


1.297 


1.524 


1.635 


56 


1.324 


1.558 


1.717 


64 


1.367 


1.618 


1.810 


72 


1.289 


1.567 


1.730 


80 


1.390 


1.676 


1.866 


88 


1.288 


1.571 


1.744 


96 


1.331 


1.596 


1.806 


104 


1.369 


1.641 


1.825 


112 


1.314 


1.594 


1.784 


120 


1.336 


1.608 


1.815 


128 


1.370 


1.655 


1.842 


136 


1.365 


1.624 


1.821 


144 


1.306 


1.596 


1.776 


152 


1.359 


1.662 


1.842 


160 


1.306 


1.592 


1.806 


168 


1.391 


1.667 


1.879 


176 


1.359 


1.609 


1.832 



The speedup data that is shown in Table 6.2 is derived from these execution times. 
Speedup was calculated using the usual formula (see Appendix A for details) 






7 \ 

T P 



for speedup on p processors. 
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TABLE 6.3: EFFICIENCIES FOR GF(PC) ON THE iPSC/2 



Dimension 

(n) 


Efficiency (percent) on a Hypercube of Order 


1 


2 


3 


8 


64.948 


34.332 


10.161 


16 


53.155 


29.441 


12.024 


24 


63.068 


34.169 


17.603 


32 


62.451 


35.716 


18.370 


40 


67.122 


39.215 


22.015 


48 


64.852 


38.098 


20.431 


56 


66.225 


38.943 


21.462 


64 


68.354 


40.450 


22.625 


72 


64.470 


39.168 


21.621 


80 


69.498 


41.905 


23.323 


88 


64.405 


39.263 


21.802 


96 


66.548 


39.903 


22.570 


104 


6S.444 


41.025 


22.816 


112 


65.695 


39.847 


22.304 


120 


66.781 


40.200 


22.6S5 


128 


68.492 


41.385 


23.022 


136 


68.246 


40.609 


22.762 


144 


65.312 


39.909 


22.203 


152 


67.927 


41.561 


23.020 


160 


65.303 


39.797 


22.574 


168 


69.571 


41.667 


23.489 


176 


67.931 


40.223 


22.898 



Given the execution times and speedups presented in Tables 6.1 and 6.2, and using 
the formula 



(as defined in Appendix A), 
to the Gauss problem. This 




we can determine the efficiency of p processors applied 
efficiency data is shown in Table 6.3. 
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Figure 6.1: Efficiencies for GF (PC) on the iPSC/2 



Many different graphical displays of this data would be interesting, but the efficiency 
data may be the most interesting since it sort of captures the success or failure of a 
parallel program (i.e., poor efficiencies should lead us to question the parallel nature 
of the algorithm). Figure 6.1 shows a scatterplot of the data from Table 6.3. 
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TABLE 6.4: EXECUTION TIMES FOR GF(PC) ON THE TRANSPUTERS 



Dimension 

(") 


Time 


'seconds] 


i on a Hypercube of Order 


0 


i 


2 


3 


4 


8 


0.0083 


0.0075 


0.0077 


0.0088 


0.0925 


16 


0.0481 


0.0392 


0.0373 


0.0372 


0.1236 


24 


0.1494 


0.1173 


0.1063 


0.1001 


0.1855 


32 


0.3417 


0.2580 


0.2220 


0.2132 


0.2947 


40 


0.6538 


0.4922 


0.4135 


0.3798 


0.4587 


48 


1.1158 


0.8202 


0.6934 


0.6397 


0.7041 


56 




1.2950 


1.0716 


0.9696 


1.0239 


64 




1.8940 


1.5688 


1.4046 


1.4407 


72 






2.2116 


1.9817 


1.9808 


80 






2.9560 


2.6529 


2.6248 


88 






3.9127 


3.4S12 


3.4090 


96 








4.4808 


4.3812 


104 








5.6442 


5.4519 


112 








7.0388 


6.7087 


120 








8.5430 


8.1252 


128 








10.3300 


9.7532 


136 










11.6930 


144 










13.653S 


152 










16.1029 


160 










18.5476 


168 










21.4437 


176 










24.4684 


^mo r 


48 


67 


92 


128 


176 



2. Data for the Transputer System 



Using the same methods, the timing (Table 6.4), speedup (Table 6.5), and 
efficiency (Table 6.6) data for the transputer system is determined. Unfortunately, 
the memory limitations of the transputers used for this work prevented comparisons 
for large problem size. Empty portions of Table 6.4 signify inavailability of data (i.e., 
execution failure due to inappropriate or excessive problem size). The maximum 
problem size that executed successfully for each configuration is listed on the last 
line of the Table. Figure 6.2 shows a scatterplot of the data from Table 6.6. 
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TABLE 6.5: SPEEDUPS FOR GF(PC) ON THE TRANSPUTERS 



Dimension 

(n) 


Speedup on a Hypercube of Order 


i 


2 


3 


4 


8 


l.m 


1.074 


0.942 


0.090 


16 


1.227 


1.28S 


1.290 


0.3S9 


24 


1.274 


1.405 


1.493 


0.805 


32 


1.324 


1.539 


1.602 


1.159 


40 


1.328 


1.581 


1.721 


1.425 


48 


1.360 


1.609 


1.744 


1.585 


56 


1.363 


1.64S 


1.821 


1.724 


64 


1.389 


1.677 


1.872 


1.826 


72 




1.691 


1.887 


1.8SS 


80 




1.734 


1.932 


1.953 


88 




1.743 


1.959 


2.001 


96 






1.975 


2.020 


104 






1.993 


2.064 


112 






1.996 


2.094 


120 






2.022 


2.126 


12S 






2.030 


2.150 


136 








2.150 


144 








2.186 


152 








2.180 


160 








2.207 


168 








2.210 


176 








2.227 
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TABLE 6.6: EFFICIENCIES FOR GF(PC) ON THE TRANSPUTERS 



Dimension 

(") 


Efficiency (percent) on 


a Hypercube of Order 


1 


2 


3 


4 


8 


55.556 


26.860 


11.775 


1.125 


16 


61.356 


32.204 


16.130 


2.431 


24 


63.693 


35.133 


18.662 


5.034 


32 


66.224 


38.477 


20.029 


7.246 


40 


66.409 


39.526 


21.514 


8. 90S 


4S 


6S.017 


40.230 


21.803 


9.905 


56 


68.167 


41.190 


22.760 


10.776 


64 


69.431 


41.913 


23.406 


11.410 


72 




42.279 


23.592 


11.801 


80 




43.358 


24.155 


12.207 


88 




43.575 


24.488 


12.504 


96 






24.691 


12.626 


104 






24.916 


12.897 


112 






24.948 


13.088 


120 






25.279 


13.2S9 


12S 






25.369 


13.435 


136 








13.440 


144 








13.662 


152 








13.623 


160 








13.795 


168 








13.812 


176 








13.917 
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Figure 6.2: Efficiencies for GF (PC) on Transputers 
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B. GAUSS WITH PARTIAL PIVOTING 



1. Data for the iPSC/2 System 



Tabic 6.7 shows the timing data for execution of the Gauss Factorization 
(partial pivoting) codes (gfpphost.c and gfppnode.c) on the Intel iPSC/2 system. 
The speedup data that is shown in Table 6.8 is derived from these execution times. 
Speedup was calculated using the usual formula (see Appendix A for details) 



s -h 

°P ~ rp 
1 V 

for speedup on p processors. Given the execution times and speedups presented in 
Tables 6.7 and 6.8, and using the formula 



E P = 



S, 

P 



(as defined in Appendix A), we can determine the effectiveness (efficiency) of p 
processors applied to the Gauss problem. This efficiency data is shown in Table 6.9. 
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TABLE 6.7: EXECUTION TIMES FOR GF(PP) ON TIIE iPSC/2 



Dimension 

(n) 


Time (seconds) on a 


Hypercube of Order 


0 


1 


2 


3 


8 


0.109 


0.130 


0.127 


0.155 


16 


0.371 


0.359 


0.394 


0.493 


24 


0.508 


0.489 


0.519 


0.624 


32 


0.752 


0.673 


0.675 


0.782 


40 


1.055 


0.880 


0.834 


0.911 


48 


1.499 


1.144 


1.024 


1.067 


56 


2.019 


1.473 


1.248 


1.228 


64 


2.733 


1.878 


1.491 


1.402 


72 


3.646 


2.412 


1.872 


1.721 


80 


4.743 


3.040 


2.256 


1.989 


88 


6.053 


3.719 


2.644 


2.237 


96 


7.567 


4.547 


3.125 


2.560 


104 


9.431 


5.477 


3.698 


2.912 


112 


11.468 


6.561 


4.252 


3.237 


120 


13.847 


7.859 


4.933 


3.646 


128 


16.552 


9.211 


5.661 


4.070 


136 


19.619 


10.873 


6.590 


4.633 


144 


23.071 


12.632 


7.532 


5.170 


152 


26.982 


14.681 


8.940 


5.866 


160 


31.204 


16.869 


9.866 


6.539 


168 


35.865 


19.318 


11.143 


7.284 


176 


41.064 


21.990 


12.605 


8.084 


200 


59.453 


31.437 


17.598 


10.910 


225 


83.962 


44.076 


24.329 


14.701 


250 


114.319 


59.515 


32.410 


19.118 


275 


151.443 


78.652 


42.336 


24.512 


300 


195.822 


102.589 


54.138 


30.927 


325 


248.153 


127.840 


68.082 


38.418 


350 


309.241 


158.859 


84.072 


46.978 


375 


379.538 


194.599 


101.984 


56.280 


400 


459.740 


235.259 


122.946 


67.366 


425 


550.536 


281.312 


147.058 


80.439 


450 


653.070 


333.180 


173.748 


94.656 


475 


767.616 


391.136 


203.513 


110.243 


500 


894.705 


455.308 


236.483 


127.631 
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TABLE 6.8: SPEEDUPS FOR GF(PP) ON THE iPSC/2 



Dimension 

(«) 


Speedup on a Hypercube of Order 


1 


2 


3 


8 


0.842 


0.860 


0.704 


16 


1.035 


0.941 


0.753 


24 


1.039 


0.979 


0.814 


32 


1.118 


1.114 


0.961 


40 


1.199 


1.265 


1.158 


48 


1.311 


1.465 


1.405 


56 


1.371 


1.618 


1.645 


64 


1.455 


1.833 


1.949 


72 


1.512 


1.948 


2.119 


80 


1.560 


2.102 


2.384 


SS 


1.628 


2.289 


2.706 


96 


1.664 


2.422 


2.956 


104 


1.722 


2.550 


3.239 


112 


1.74S 


2.697 


3.543 


120 


1.762 


2.807 


3.798 


128 


1.797 


2.924 


4.067 


136 


1.804 


2.977 


4.235 


144 


1.826 


3.063 


4.462 


152 


1.83S 


3.018 


4.600 


160 


1.850 


3.163 


4.772 


168 


1.857 


3.219 


4.924 


176 


1.867 


3.258 


5.080 


200 


1.891 


3.378 


5.449 


225 


1.905 


3.451 


5.711 


250 


1.921 


3.527 


5.980 


275 


1.925 


3.577 


6.178 


300 


1.909 


3.617 


6.332 


325 


1.941 


3.645 


6.459 


350 


1.947 


3.678 


6.583 


375 


1.950 


3.722 


6.744 


400 


1.954 


3.739 


6.825 


425 


1.957 


3.744 


6.844 


450 


1.960 


3.759 


6.899 


475 


1.963 


3.772 


6.963 


500 


1.965 


3.783 


7.010 
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TABLE 6.9: EFFICIENCIES FOR GF(PP) ON THE iPSC/2 



Dimension 

(«) 


Efficiency (percent) on a Hypercube of Order 


1 


2 


3 


8 


42.085 


21.499 


8.803 


16 


51.743 


23.526 


9.416 


24 


51.943 


24.470 


10.174 


32 


55.911 


27.842 


12.019 


40 


59.943 


31.615 


14.472 


48 


65.544 


36.615 


17.563 


56 


68.557 


40.453 


20.560 


64 


72.764 


45.825 


24.365 


72 


75.580 


48.698 


26.482 


80 


78.023 


52.554 


29.804 


88 


81.390 


57.228 


33.821 


96 


83.218 


60.541 


36.955 


104 


86.104 


63.762 


40.482 


112 


87.402 


67.427 


44.287 


120 


88.096 


70.175 


47.475 


128 


89.849 


73.097 


50.832 


136 


90.219 


74.430 


52.934 


144 


91.323 


76.577 


55.781 


152 


91.897 


75.451 


57.497 


160 


92.492 


79.072 


59.651 


168 


92.830 


80.469 


61.544 


176 


93.372 


81.442 


63.498 


200 


94.559 


84.462 


68.115 


225 


95.247 


86.278 


71.393 


250 


96.042 


88.181 


74.744 


275 


96.274 


89.430 


77.230 


300 


95.440 


90.427 


79.147 


325 


97.056 


91.123 


80.742 


350 


97.332 


91.958 


82.283 


375 


97.518 


93.039 


84.297 


400 


97.709 


93.484 


85.307 


425 


97.851 


93.591 


85.552 


450 


98.006 


93.968 


86.243 


475 


98.127 


94.296 


87.037 


500 


98.253 


94.584 


87.626 
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Figure 6.3: Efficiencies for GF (PP) on the iPSC/2 

Here, again, only the efficiency is plotted. Figure 6.3 shows a scatterplot of the data 
from Table 6.9. 
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2. Data for the Transputer System 



Using the same methods; the timing (Table 6.10), speedup (Table 6.1 1), and 
efficiency (Table 6.12) data for the transputer system is determined. Unfortunately, 
the memory limitations of the transputers (32 kilobytes per node) used for this 
work prevented comparisons for large (interesting) problem size. Empty portions of 
Table 6.10 signify inavailability of data (i.e., execution failure due to inappropriate 
or excessive problem size). The maximum problem size that executed successfully 
for each configuration is listed on the last line of Table 6.10. The minimum problem 
size for the hybrid cube on 16 processors was one where the dimension of A was 
n = 16. 
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TABLE 6.10: EXECUTION TIMES FOR GF(PP) ON THE TRANSPUTERS 



Dimension 

(n) 


Time (seconds) i 


on a Hypercube of Order 


0 


1 


2 


3 


4 


8 


0.0906 


0.0904 


0.0906 


0.0909 




16 


0.1126 


0.1101 


0.1102 


0.1107 


0.1092 


24 


0.1 5S2 


0.1480 


0.1462 


0.1461 


0.1439 


32 


0.2312 


0.2038 


0.1965 


0.1952 


0.18S9 


40 


0.3360 


0.2765 


0.2568 


0.2520 


0.2446 


48 




0.3782 


0.3402 


0.3297 


0.3149 


56 




0.5124 


0.4463 


0.4258 


0.4064 


64 




0.6911 


0.5863 


0.5505 


0.5196 


72 






0.7277 


0.6715 


0.6308 


80 






0.8976 


0.8147 


0.7560 


88 






1.0675 


0.9482 


0.8732 


96 








1.1584 


1.0581 


104 








1.3657 


1.2430 


112 








1.6129 


1.4551 


120 








1.8388 


1.6490 


128 










1.8585 


136 










2.1306 


144 










2.3606 


152 










2.6717 


160 










2.9S46 


168 










3.2910 


176 










3.6606 


W mar 


47 


66 


92 


127 


176 
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TABLE 6.11: SPEEDUPS FOR GF(PP) ON THE TRANSPUTERS 



Dimension 

(n) 


Speedup on a Hypercube of Order 


1 


2 


3 


4 


8 


1.002 


1.000 


0.997 




16 


1.023 


1.022 


1.017 


1.031 


24 


1.069 


1.082 


1.083 


1.099 


32 


1.134 


1.177 


1.184 


1.224 


40 


1.215 


1.30S 


1.333 


1.374 


48 


1.302 


1.447 


1.493 


1.563 


56 


1.387 


1.592 


1.669 


1.748 


64 


1.448 


1.707 


1.818 


1.926 


72 




1.888 


2.046 


2.178 


80 




2.049 


2.258 


2.433 


88 




2.256 


2.539 


2.758 


96 






2.667 


2.920 


104 






2.853 


3.134 


112 






2.998 


3.323 


120 






3.219 


3.590 


128 








3.852 


136 








4.019 


144 








4.296 


152 








4.456 


160 








4.646 


168 








4.871 


176 








5.031 
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TABLE G.12: EFFICIENCIES FOR GF(PP) ON THE TRANSPUTERS 



Dimension 

(”) 


Efficiency (percent) on 


a Hypercube of Order 


i 


2 


3 


4 


8 


50.111 


25.000 


12.459 




16 


51.135 


25.544 


12.715 


6.445 


24 


53.446 


27.052 


13.535 


6.871 


32 


56.722 


29.415 


14.805 


7.650 


40 


60.759 


32.710 


16.667 


8.585 


48 


65.090 


36.180 


18.666 


9.772 


56 


69.334 


39.801 


20.859 


10.927 


64 


72.412 


42.678 


22.727 


12.039 


72 




47.193 


25.571 


13.611 


80 




51.228 


28.220 


15.206 


ss 




56.392 


31.744 


17.235 


96 






33.343 


18.252 


104 






35.657 


19.5S9 


112 






37.475 


20.770 


120 






40.241 


22.436 


12S 








24.073 


136 








25.116 


144 








26.849 


152 








27.850 


160 








29.036 


168 








30.447 


176 








31.441 
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Figure 6.4: Efficiencies for GF (PP) on Transputers 



Figure 6.4 shov.-s a scatterplot of the data from Table 6.12. 
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VII. CONCLUSIONS 



1 value the discovery of a single even insignificant truth more highly than all 
the argumentation on the highest questions which fails to reach a truth. 

— GALILEO (1564-1642) 

A. SIGNIFICANCE OF THE RESULTS 
1. Communications and Computation 

Perhaps one of the most obvious effects that can be noticed in the results 
of Chapter VI is the abysmal performance of the complete pivoting code when com- 
pared to the partial pivoting implementation. The relatively small amount of extra 
communications required for the complete pivoting algorithm seems to force syn- 
chronization delays, thus reducing the system’s performance. This demonstrates the 
criticality of balancing communications with calculation in parallel processing. The 
conclusion, for this problem, is that parallel designs must minimize the frequency of 
synchronizing events and minimize the communications volume on occasions when 
communication is necessary. The greater the amount of uninterrupted work that a 
processor can accomplish, the better. While control, i.e., blocking communications, 
synchronization, loop-by-loop data distribution, is necessary it will have adverse im- 
pacts on performance. The individual processors of a multiprocessor system should 
be granted the maximum degree of independence that the mission will allow. 

While there is undoubtedly some room for improvement in the complete 
pivoting code, it would appear that maximum efficiencies of approximately 22%, 
40%, and 70% for hypercubes of order three, two, and one, respectively, are likely on 
the iPSC/2. The same code seems to be headed for somewhat better performance 
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on the transputers, but with the shortage of memory, it is difficult to extrapolate 
and determine the direction of the plots. The higher order cubes appear to flatten 
at about the same efficiency that the iPSC/2 showed as a terminal efficiency. 

The partial pivoting code, on the other hand, exhibits the kind of charac- 
teristics that we like to see in parallel code. Both systems show T efficiencies rising 
sharply (again, the size limit for the transputers is unfortunate) and the iPSC/2 
shows some very nice results as the dimension of the matrix exceeds about 250. 

B. THE TERAFLOP RACE 

One of the biggest challenges to parallel computing today can be found in the 
“teraflop race’ - . There are at least three competitors with teraflop initiatives: the 
United States, Europe, and Japan. The United States effort centers around Intel 
with projects like Touchstone (Chapter I). The European effort relies on the T9000 
transputer. Considering the three to five year old technology used for this research, 
together with the numbers that the various parallel computer designers boast today, 
it seems that we might see teraflop performance by the mid-1990s. C. Gordon Bell 
claims that the teraflop is conceivable [Ref. 6: p. 1099] 

Two relatively simple and sure paths exist for building a system that could 
deliver on the order of 1 teraflop by 1995. They are: (1) A J,K node multicomputer 
with 800 gigaflops peak or a 32K node multicomputer with 1.5 teraflops. (2) A 
Connection Machine with more than one teraflop and several million processing 

elements. 

Current products suggest that INMOS and Intel will be among the most likely 
competitors. Table 7.1, adapted from Jack Dongarra’s report [Ref. 8: p. 20], shows 
how transputer-based systems compare to Intel products. This Table summarizes a 
test involving the solution for a 1000 x 1000 system of linear equations. The proces- 
sors used for my thesis show floating-point capabilities of 0.37 Mflops (T800-20) and 
0.16 Mflops (Compaq 386/20 with 80387) in Dongarra’s report [Ref. 8 : pp. 14, 16]. 
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TABLE 7.1: PARALLEL MACHINE COMPARISON 



Computer 




P 


t p 


Speedup 


E f jiciency 


Parsytec FT-400 


1075 


400 


4.90 


219.0 


.55 


Parsytec FT-400 


1075 


256 


6.59 


163.0 


.64 


Parsytec FT-400 


1075 


100 


13.20 


81.4 


.81 


Parsytec FT-400 


1075 


64 


19.10 


56.3 


.88 


Parsytec FT-400 


1075 


16 


69.20 


15.5 


.97 


Intel iPSC/860 


59 


32 


5.30 


11.0 


.34 


Intel iPSC/860 


59 


16 


6.80 


8.7 


.54 


Intel iPSC/860 


59 


8 


10.60 


5.6 


.70 



The i PSC/SCO illustrates the most recent technology and shows excellent uniproces- 
sor performance (6.5 Mfiops) [Ref. 8 : p. 9]. The T800 transputer that Parsytec 
used is somewhat dated and will soon be replaced by the T9000. Nevertheless, the 
transputer-based system shows good parallel performance. The times of execution in 
the experiments of this thesis also indicate that the T800 is faster for floating-point 
calculations than the 386/387 combination in the iPSC/2. 



C. FURTHER WORK 



My research suggests many areas for further investigation. The method of 
conjugate gradients shows a great deal of promise as a candidate for parallelization. 
Indeed, it was the original aim of this thesis, but the development of other portions of 
the code required a great deal of time. The parallel CG algorithm should be relatively 
simple to code and holds great potential with respect to performance. Additionally, 
it possesses a nontrivial derivation and the theory behind the algorithm would be 
interesting to develop. 

There are many other variations on Gauss factorization that could be coded 
and tested. While the programs presented in this thesis are designed in an effort 
to produce efficient performance, there is undoubtedly much that might be done to 
enhance this code. Among the options: at a very basic level, we could begin with 
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other distributions of the matrix A. A block method or row method may actually 
yield better performance. As the UNPACK benchmarks seem to use blocks, this is 
probably worth pursuing. 

General purpose parallel computing, the ability to rely on parallel architectures 
for general purpose computation without a need for investigation to be more con- 
cerned with the architecture than the problem being computed, still requires much 
work. The ability to use parallel architectures as a computational tool to solve 
problems will mark an increasing maturity in this field. 

Applying object-oriented design and programming paradigms to the parallel 
world may hold a great deal of promise. In particular, the C++ language seems to 
be a prudent choice for parallel programming. 

In addition to the more practical options, the study of parallel theory and al- 
gorithms seems interesting and shows a great need for development. In particular, 
this field seems to need a more-or-less general (at least for MIMD machines) ap- 
proach to classifying parallel algorithms and specifying their performance. As noted 
in Chapter IV, a mixture of this field with graph theory may hold a great deal of 
promise. 

On an initial glance, the use of the Ada programming language with its inbuilt 
tasking constructs might seem optimum for the type of computing investigated in 
this thesis. Ada, in this regard, however, is optimized for use with shared memory 
multiprocessors. The use of Ada on transputers still requires much experimentation 
and better tools. Presently only one, rather expensive, Ada compiler is available for 
transputer use. Its required use of occam harnesses makes using Ada on transputers 
awkward at best. Further research is needed to create a better environment for Ada 
programming on transputers. Given the significance of Ada to the DoD establish- 
ment, this should become a priority. The inclusion of a standard math package and 
the advent of Ada 9X may hold some promise in this regard. 



120 



APPENDIX A 



NOTATION AND TERMINOLOGY 



This appendix explains the shorthand used in the rest of the thesis. Con- 
ventions , by definition, are generally accepted rules of the business. This would 
seem to obviate the need for further discussion of conventions, but there are sev- 
eral good reasons for discussing notation and terminology. First, the notation may 
not be conventional. In the absence of convention (or when the foundation that it 
provides is inadequate) a more substantial agreement is required. Second, even for 
conventional notation, the audience may be diverse enough to warrant familiariza- 
tion. The following discussion provides this familiarity and gives the terms of an 
agreement to establish the meaning of the words and symbols used in the rest of 
the work. On occasion, neither convention nor this agreement will suffice. These 
situations will be handled case-by-case with the philosophy that clarity should 
never be sacrificed for brevity. 

A. BASICS 



Most of the work deals with the integers, Z (from the German word for numbers, 
Zahlen), the set of real numbers, R, and the complex numbers, C . Often, the 
German 3? is used to represent the reals. A complex number is a number, x + iy = 
2 E C, that has a real part ( x E 3?) and an imaginary part ( y E 3?), with the complex 
unit i = \/—l. Sometimes the real part is denoted Re(r) and Im(c) is used to 
represent the imaginary part. 

A scalar is simply a real number, and is usually denoted by a lower-case Greek 
letter. 1 A vector is an ordered set of scalars. Lower-case Latin letters like b, x, and 
y are used to denote vectors. Sometimes an arrow’ is placed above the name of a 
vector — like x — to emphasize the fact that it is a vector. 



'The Greek alphabet is shown in the Table of Symbols. 
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Matrices are two dimensional and usually contain real or complex elements. 
Capital letters (Greek or Latin) are used to represent matrices. Common examples 
include A, l\ Q , R, A, and E. 

The number systems introduced above cannot be represented in a finite space. 
There are two basic problems. First, we should consider the size (or cardinality) of 
the sets. The integers are countable or denumerable since there exists a one-to-one 
mapping between Z and the natural numbers, N. This is an advantage in finite 
storage since it means that we can choose a finite range of the integers and be quite 
certain that every integer in that range is represented (exactly). Even though Z is 
denumerable, it is a set with infinite cardinality. 

The real numbers present a more difficult situation for finite storage. The real 
number line is dense in comparison to the integers. 3 £ is not only an infinite set, it is 
not countable (i.e., 5? is uncountable). It is said to have the power of the continuum. 
To represent a real number, x, we use the floating-point approximation, fl (x) , to x. 
This is a number that may be described by three parts: the sign s, the exponent e, 
and the mantissa d. An illustration of such a number is provided in Chapter II. 

B. COMPLEX NUMBERS 
1. Notation 

The previous section introduced one notation for complex numbers; namely, 
z = x + iy. There are several other representations, each of which makes its own 
contribution in practical use. Electrical engineers usually replace the i with j since i 
is used to represent electrical current. Since the complex number can be represented 
by an ordered pair of real numbers, the graphical notation of Figure A.l is natural. 
In this plane, the real and imaginary axes are used to represent the components of 
a complex number. 
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x 



Figure A.l: The Complex Plane 

The vector sum of these two parts, z = x + y, is an equivalent and useful 
way to model complex numbers. There is yet another way to describe z. Let r be the 
magnitude of the vector z and let 0 be the angle measured from the positive real axis 
counter-clockwise to z. Using this notation, we could use trigonometry to describe 
the complex number as 2 = r(cosO + isinO). The Euler formula [Ref. 32: p. 74], 

e z = e T+,y = c x t xy = e r (cos y + i sin y), (A.l ) 

can be used to convert a complex number to yet another form: 2 = re ,e . 
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2. Operations 



a. Addition and Subtraction 



Addition and subtraction of complex numbers is performed in the same 
manner that vectors are added or subtracted. For instance, let Zj = a + ib and let 
z 2 = c — id. Then the sum, zj + z 2 > is the same as the sum of the corresponding 
vectors: 



Z 1 + z 2 ~ 




a + c 
b-d 



(A.2) 



so the sum is Z\ + z 2 = (a + c) + i(b — d). Differences are handled in the obvious way, 
as vector differences. 



b. Multiplication 

Multiplication is performed by applying high school algebra. For the 
same complex numbers z \ and z 2 : 

2 j x z 2 = (a 4- ib)(c — id) = ac — ( a)(id ) + (t‘6)(c) — (ib)(id) (A. 3) 

and using the definition of the complex unit, i = \f—l , we may combine the middle 
terms and move the i 2 = —1 outside the last term to find the (complex) product: 

z \ X z 2 = ac — i(ad — be) + bd = (ac + bd) — i(ad — be) (A. 4) 

c. Conjugation 

The complex conjugate of a complex number z = x + iy is defined as 
z = x — iy. This simple operation finds practical application in complex division. 



d. Division 

Consider the quotient (zi/z 2 ) of the same complex numbers that were 
used in equations A.2, A. 3, and A. 4. If we multiply both the numerator and the 
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denominator by the complex conjugate of the denominator, z 2 , we have: 



Z] a + ib (a + ib)(c + id) 

z 2 c — id ( c — id)(c + id ) 

and then, by applying i 2 = —1, we conclude: 

Z\ ac — bd + i(bc + ad) 

~ = c 2 + d 2 “ 



ac + i(ad) + i(bc) + i 2 (bd) 
c 2 - Pd 2 



(ac — bd) t .(be -f ad) 
(c 2 + <P) + ‘(c‘ + 



(A.5) 



(A.6) 



As a practical matter, this is not the way we would compute a complex quotient. 
The code given in Appendix F (function cdiv() in complex. h) provides a method 
that is better suited to the finite precision environment. 



C. VECTORS AND MATRICES 
1. Columns and Rows 



Vectors are ordered collections of scalars represented as columns. Let 
q, /3 , 7 € C with q = 1.0 + ?4 .0, = 2.0 — z5.0, and 7 = 3.0 -fi z 6 . 0 . Then: 





a 




' 1.0 + z'4.0 ' 


X = 


p 


= 


2.0 - z'5.0 




. 7 . 




3.0 + z‘6.0 



If row -orientation is intended the transpose is used: 

x T =[a p 7 ] = [ (1.0 + i4.0) (2.0 — *5.0) (3.0 + *6.0) ] 



Matrices may be formed as ordered combinations of elements, vectors, or blocks. 
Suppose that p = 3.0 and u = 7.0. Then, with x as given above, the following 
matrices are equivalent: 



A=[ 



x px vx 



1.0 + z4.0 3.0 + *12.0 7.0 + z'2S.O 

2.0 — £5.0 6.0 - z 15.0 14.0 — *35.0 

3.0 + z6.0 9.0 + zlS.O 21.0 + i42.0 



(A. 7) 



An element within a matrix is usually denoted A(i,j), where i is the row index and 
j is the column index. For instance, .4(1,3) = 7.0 + z28.0 in (A. 7). 
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A block of the matrix A is a rectangular matrix D within A. MATLAB 
notation is useful. For instance, B = A(i : j,k : /) means that B is the block of A’s 
rows i through j and columns k through /. The row or column V means all rows or 
all columns. For instance: 



B - +(:, 1:2) = 



1.0 H- *4.0 3.0 + 212.0 
2.0-25.0 6.0-215.0 

3.0 T ?6.0 9.0 T z 1 8 . 0 



(A.8) 



As a sidenote, a number with a decimal point should usually be taken as 
a real number. Mathematically speaking, 1 = 1.0. But many compilers treat 1 
as an integer and use the decimal point to recognize 1.0 as a floating-point value. 
Therefore, all of the code associated with this work and most of the examples use 
the decimal point as a clue that the number is a real number or its floating-point 
approximation. 

2. Conjugation and Transposition 

The conjugate of a vector or matrix is simply a vector or matrix whose 
entries are the conjugates of the original entries. A superscript C is used to denote 
the conjugate of a vector or matrix. For instance, with A as given A. 7, 



+ c = 



1.0 — 24.0 3.0 - 212.0 7.0 — 228.0 

2.0 + 25.0 6.0 + 215.0 14.0 + 235.0 

3.0 - 26.0 9.0 - 2+8.0 21.0 - 2+2.0 



(A- 9) 



The transpose of a vector or matrix, denoted with a superscript T, refers to 
a transposition of its rows and columns. With + G C mXn , the effect of transposition 
is that +(2, ji) = A T (j, i) for all i such that 1 < i < m, and all j so that 1 < j < n. 
For example, consider the transposition of the matrix + that is found in equation A. 7. 



(A- 10) 





r T 1 
x 1 




1.0 + 2+.0 2.0 — 25.0 3.0 + 26.0 


a t = 


T 

fiX 


= 


3.0 + 2+2.0 6.0 - 2+5.0 9.0 + 2+8.O 




T 

VX 




_ 7.0 + 228.0 14.0 - 235.0 21.0 + 2+2.0 _ 
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In this example we see that the columns of a matrix become the rows of its transpose. 
This example also demonstrates that when we first transpose, and then stack the 
columns of a matrix, we arrive at the transpose of the matrix. In the event that 
A = A T , we say that A is symmetric. 

The conjugate (or Hermitian) transpose of A is A l{ . This matrix is the 
result of combining the conjugation and transposition operations on A. The following 
example shows the Hermitian transpose of A: 



A" = 



1.0 — t'4.0 2.0 + r'5.0 3.0 — i6.0 

3.0-112.0 6.0 + il5.0 9.0 - i'18.0 

7.0 — i28.0 14.0 + i35.0 21.0 -i'42.0 



(a. ii) 



If A = A H , we say that “ A is Hermitian.” We should never confuse “ A is Hermitian” 
with U A Hermitian” (the conjugate transpose, A H , of A). [Ref. 33: p. 294] 



3. Zeros 



It could be argued that zero is the most important number. In addition to 
its use as a number, zero is also used to represent a vector or matrix in which every 
element is equal to zero. In the (extremely rare) event that the context does not 
clearly indicate the size of a “0-vector” or “0-matrix”, its size will be given explicitly. 
In the absence of implied or specified size, 0 should be interpreted as the number 
zero. Additionally, blank space within a matrix usually means that all elements in 
that region are zero. 

4. Special Forms 
a . Axis Vectors 

An axis vector , e,, is simply the i th column (or row) of the identity 

matrix. 
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b. Lower Triangular 



A lower triangular matrix , usually denoted L, has the form 



L = 



x 

X X 
XXX 



(A.12) 



If L has ones on the diagonal, it is called unit lower triangular. Similarly, the upper 
triangular matrix U has the form 



U = 



XXX 
X X 
X 



(A. 13) 



U is called unit upper triangular if the diagonal elements are all ones. Sometimes 
(e.g., Chapter III) such a matrix is called right triangular and denoted R. When the 
matrix is not square, the lower and upper triangular ideas are translated to lower and 
upper trapezoidal , with the unit trapezoidal matrices having ones on the diagonal. 
The following matrices illustrate the different kinds of trapezoidal matrices. The 
matrices may be tall and skinny as 



U = 



X 


X 


X 




X 








X 


X 




X 


X 








X 


L = 


X 


X 


X , 










X 


X 


X 










X 


X 


X 



(A.14) 



or short and fat 



u = 


1 

X X 

X X 

X X 

X X 

X 

1 


L = 


X 

X X 




1 

X 

X 

X 




1 

X 

X 

X 

1 



(A.15) 



D. NORMS 



The information below was taken from [Ref. 21 : pp. 53-60], so it seems fitting 
to begin with a few of Golub and Van Loan’s comments on norms. 
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Norms serve the same purpose on vector sjxices that absolute value does on 
the real line: they furnish a measure of distance. More precisely, together with 
a norm on defines a metric sjxice . Therefore , we have the familiar notions 
of neighborhex)d, open sets , convergence , and continuity when working with vectors 
and vector-valued functions. 

1. Vector Norms 

a. Definition 

A vector norm on SR" is a function / : 3R" — *■ 3R that satisfies the following 
properties [Ref. 21 : p. 53]: 



/(x) > 0 x € 3R n , (/(x) = 0 iff x = 0) 


(A. 1C) 


f(x + y) < f(x) + f(y) X,ye3? n 


(A. 17) 


f(ax) =| g | f(x) a £ 3R, x 6 -ft 71 


(A. 18) 



We denote such a function with a double bar notation: /(x) = || x ||. 



b. The j>-Norm 

Subscripts on the double bar are used to distinguish between various 
norms. The most popular example of this is the p-norm, || • || p . This norm is 
defined by [Ref. 21 : p. 53] 

II x IIp= (I l P +•••+ I \ p )r p> 1. (A. 19) 

The 2-norm is the one used most frequently in this w'ork, but the 1 and oo-norms 
find frequent application in other work. A natural representation of the 2-norm is 
the square root of an inner product 

II x || 2 = (| x, | 2 + • • • + | x n | 2 )? = Vx^x (A. 20) 

The 2-norm of x is the Euclidean length of the vector x. 
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2. Matrix Norms 



a. Definition 

A matrix norm on 9i mX, ‘ is a function / : 5R mXTI — + 3? that satisfies 
properties similar to those presented in the vector case [Ref. 21 : p. 56]: 

/(A)>0 A€ft mx, \ {f {A) = 0 iff A = 0) (A.21) 
f(A d D) < f(A) + f(D) A, He 3? mx ” (A. 22) 

f(oA) =| o | f(A) o € 3?, A € » mxn (A.23) 

Matrix norms also use the double bar notation: f(A) — || A || The Frobcnius norm 
and the /> norm are the most common matrix norms 



b. Frobcnius 



The Frobcnius norm is defined as 



A II r= 



.EEK 

\ i=1 ;=i 



12 



c. ]>- Norms 



The /> norm of a matrix, /t, is defined by 

II Ax || ;) 



A | ,,= sup 



**o || r 



E. LINEAR SYSTEMS 



(A.24) 



(A. 25) 



One of the fundamental tasks of linear algebra is to form a matrix representation 
of a system of linear equations. Consider the system of linear equations: 

2i/i + 3 i/2 — 4 1/ 3 = 7 

3 u \ — 5i< 2 -f 1 113 = 3 

•In, -+ 6 u 2 — 2i/ 3 = 1 
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(A.26) 



This system of equations can be expressed using the matrix notation Au = b 



1 

1 

CO 

CM 

i 




Hi 




' 7 ' 


3-5 7 




u 2 


= 


3 


r ~ 

O 

1 

to 




. . 




_ 1 _ 



F. MEASURES OF COMPLEXITY 



The first, and most rudimentary requirement for an algorithm is that it produce 
the correct answer. This seems utterly obvious, but it must never be lost in the 
algorithm designer’s pursuit of the next most important elements — efficiency in using 
time and space. For the moment, we shall assume that the algorithm arrives at ari 
acceptable answer. Then the algorithm’s use of time and space becomes a very 
serious subject. Knuth provides the notation in [Ref. 34]. 

The time complexity of an algorithm, also known as running time, describes how 
the program works under a stopwatch. Space complexity is the amount of temporary 
storage required to carry out the algorithm. For example, suppose a person stood at 
a chalkboard, ready to solve a problem. We would not regard the input or output 
storage space, but only the required space on the chalkboard, in the space complexity 
of the problem. Usually we like to link the idea of complexity to the input size of the 
problem, n. The following discussion of time complexity outlines a few tools that 
are standard in the study of algorithms. The same tools and ideas apply for space 
complexity analysis. [Ref. 35: pp. 42-43] 

The most common method for describing the time complexity of an algorithm 
is the “big-Oh” notation [Ref. 35: p. 39]. 2 A function g(n ) is 0(f{n)) if there exist 
constants c and N so that, for all n > A’, g(n) < c/(n). 

g (n) = 0(f(n)) g(n) < c/(n), n > N (A. 28) 



2 0(f(n)) is read “order /(n). r 
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This means that for a large enough problem size n, the time to execute g(n ) is a 
constant multiple of some function, f(n). Big-Oh notation does not mean a least 
■upper bound , only an upper bound for n sufficiently large. Practically, 0(/(n)) must 
be augmented so that we may determine how tightly cf(n) bounds g(n). 

By adding a lower bound to big-Oh, we may arrive at a more informative 
statement concerning an algorithm’s complexity. This is achieved through the use of 
“big Omega”. T(n) = f l(g(n)) means that there exist constants c and N such that, 
for all n > N, the number of steps T(n) required to solve the problem for input size 
n is at least cg(n). 



T(n ) = Q(g(n)) <=$■ T(n) > cg(n), n > N (A. 29) 

This is essentially a lower bound on time complexity. If a function, f(n) satisfies 
both /(n) = 0(g(n )) and f(n) = fl(#(n)) — not necessarily using the same constants 
c and N for both O and 0 — then we say that /(n) = Q(g(n)). [Ref. 35: p. 41] 

f(n) = 0{g{n)) = £l(g(n)) <=> f(n) = Q(g(n)), n>N (A. 30) 

Now and then, notation similar to O and fi is required except that a strict inequality 
is desired. In this case, we use “little oh” and “little omega”. The definitions are: 

f(n) = o(g(n)) Jim = 0 <=t> g(n) = w(/(n)) (A. 31) 

We have seen that 0, fl, 0, o , and u are roughly equivalent to the inequalities 
<, >, =, <, and >, respectively. Is this notation meaningful? Does it have utility in 
problem solving? The answer is a guarded “yes.” We must understand the purpose 
of the notation. It cannot substitute for timing data taken from the actual execution 
of an algorithm. It is intended as a good first estimate. There are too many variables 
involved in modern tools and machinery to expect accurate analysis from other than 
actual execution. 
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TABLE A.l : ALGORITHM COMPLEXITY AND MACHINE SPEED 



Algorithm 

Complexity 


Exccuti 


on Time (in Seconds) for Machine Speed 


1000 steps/sec 


2000 steps/scc 


4000 steps/sec 


8000 steps/sec 


log 2 n 


0.01 


0.005 


0.003 


0.001 


7} 


1 


0.5 


0.25 


0.125 


71 log 2 71 


10 


5 


2.5 


1.25 


n i.s 


32 


16 


8 


4 


71 2 


1,000 


500 


250 


125 


7l 3 


1,000,000 


500,000 


250,000 


125,000 


1.1" 


10 39 


10 39 


10 38 


10 38 



Nevertheless, a rough estimate of how a problem grows is important to the prob- 
lem solving process. Indeed, experimental results and complexity analysis should not 
usually be considered independently, but compared and used as complementary in- 
struments. The time complexity of an algorithm is, in a sense, more important than 
the speed of the machine upon which it is executed. Consider the data in Table A.l 
(adapted from (Ref. 35: p. 41]). This is based upon a problem of size n = 1000 and 
demonstrates the ability of an algorithm to dominate a machine. For this reason, 
and with these conditions clearly established, we will find many occasions to use 
time- and space-complexity notation. 

Finally, the two most common performance measures for parallel computing 
are speedup and efficiency. Suppose that 7’„ is the time of execut ion for a particular 
algorithm, A, on n processors. Consider the best uniprocessor time J\ for a sequential 
version of A compared to the execution of an equivalent (not necessarily the same) 
parallel program on P processors that executes in time Tp. Then speedup, Sp, is 
defined as 




and the efficiency, Ep , is defined to be 



Ep = 



Sp 

P 
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APPENDIX B 



EQUIPMENT 



A transputer is a microcomputer with its own local memory and with links 
for connecting one transputer to another transputer. 

The transputer architecture defines a family of programmable VLSI com- 
ponents. The definition of the architecture falls naturally into the logical as- 
pects which define how a system of interconnected transputers is designed and pro- 
grammed , and the physical aspects which define how transputers, as VLSI compo- 
nents, are interconnected and controlled. 

A typical member of the transputer product family is a single chip containing 
processor, memory, and communication links which provide point to point con- 
nection between transputers. In addition, each transputer product contains special 
circuitry and interfaces adapting it to a particular use. For example , a peripheral 
control transputer, such as a graphics or disk controller, has interfaces tailored to 
the requirements of a specific device. 

A transputer can be used in a single processor system or in networks to build 
high performance concurrent systems. A network of transputers and peripheral 
controllers is easily constructed using point-to-point communication. 

— INMOS 

This introduction is provided by the transputer’s maker in [Ref. 36: p. 7]. 

A. TRANSPUTER MODULES 



INMOS makes a wide variety of microprocessors to suit differing needs. To 
provide a simple, modular interface they have developed the notion of a transputer 
module (TRAM). The TRAM is a small board containing the microprocessor, RAM, 
other circuitry, and a standard sixteen signal interface. 

B. THE IMS B012 

Most of the later experiments were carried out on an IMS B012 board. This 
board accommodates sixteen transputers; each of which is installed on its own IMS 
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B401 TRAM. In our case the TRAM holds 32 kilobytes of memory (in addition to 
the four kilobytes onboard the T800-20 transputer). 

d. 1NMOS Transputers 

The INMOS transputer gives the system designer a tremendous amount 
of latitude. With these processors — perhaps more than with any other parallel 
architecture — one should give careful thought to the size, component processors, and 
interconnection topology as the first elements in designing a solution to a problem. 
This cannot be overemphasized. When the hardware is not “general purpose’’ in na- 
ture, it must receive thoughtful consideration along the path to solving the problem. 
Some of the largest applications for parallel machines — especially for transputers — 
are embedded systems. 

An embedded computer system is defined as “one that forms a part of 
a larger system whose purpose is not primarily computational.” [Ref. 37: pp. 15-16] 
To automatically accept or assume a particular machine configuration is to relinquish 
control of one of the tools available in system design. 

Transputer is the name given to the members of a family of microproces- 
sors. While INMOS is the largest producer of these processors, they have not chosen 
to protect the name transputer with any sort of trademark. The name comes from 
a combination of “transistor computer” and each transputer is essentially a com- 
puter on a chip. The chip possesses an arithmetic logic unit (ALU), memory, and a 
communication system that supports bidirectional serial communication links. Most 
of the transputers used for this research also include a 64-bit (IEEE 754 standard) 
floating-point unit (FPU). 

The transputer module (TRAM) is the most common package for trans- 
puters. The capabilities of these modules are quite diverse, but they hold to a 
standard interface design. This makes the TRAM easy to use. Systems designed 
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around TRAMS enjoy simple replacement of components, ease of modification, and 
great scalability. Indeed, the laboratory environment in which these TRAMs were 
exercised is a very dynamic one. 

The PARCDS laboratory has six 80286-based IBM-compatible personal 
computers, each of which contains a transputer interface board. Five hold IMS B004 
boards and one holds a Transtech TMB08 board. The B004 boards each have two 
megabytes of memory and an IMS T414 transputer in addition to the requisite 
serial-to-parallel converter and interface circuits. The TMB08 holds four megabytes 
of memory and an IMS T800-20 transputer. These “host” machines can each be 
connected to an arbitrarily large network of transputers. 

For this purpose, we have two INMOS Transputer Evaluation Module 
(ITEM) boxes. These boxes can hold at least ten boards of the Double Eurocard size 
(approximately 22 cm x 23.5 cm). Of primary interest for this thesis was the IMS 
B012 board; a motherboard capable of supporting sixteen TRAMs. For this research, 
all sixteen slots were filled with a TRAM that held an IMS T800-20 transputer and 
32 kilobytes of TRAM memory (in addition to the transputer’s four kilobytes). The 
shortage of memory is probably the greatest deficiency and indicator of the outdated 
nature of these processors. TRAMs with four and eight megabytes of memory and 
IMS T805-25 transputers are currently available for less than $900.00 and $1,300.00 
respectively. 



e. Intel iPSC/2 

The iPSC/2 used for this research contained eight node processors of 
the “CX” type (80386/80387 combination). Like the transputers, this machine is 
somewhat dated. Today’s i860 chips have exceedingly more capacity. 
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C. SWITCHING METHODS 



The iPSC/2 and transputer hardware use of different switching methods. Intel 
uses a circuit switching approach, whereas the INMOS approach is store-and-forward 
switching. Each approach has advantages and disadvantages. The circuit switching 
approach is “almost universally used for telephone networks.” [Ref. 3S: p. 12] The 
idea is to first define a path (close a circuit) from the source to the destination and 
then use it as a dedicated line. 

This requires a start-up overhead that depends entirely upon the current load 
being handled by the system. If any part of the medium (links or switches) between 
the source and destination is busy, the message will wait at the source until the 
entire path is clear. The path is determined (in the iPSC/2 case) in a deterministic 
fashion, so that a message from node i to node j will always insist on a particular 
path, even if some other communication is blocking that path. As the path becomes 
clear, switches between the source and destination are set so that a dedicated line 
will exist from source to destination. 

After the overhead of establishing (closing) the circuit has been paid, commu- 
nication proceeds at a rapid rate. The intermediate nodes along the path do not 
store the message. Instead, their switches have been set so that the message flows 
through. Intuitively, this approach should be quite effective in a network with a very 
structured interconnection topology and a relatively small number of nodes. The 
hypercube gives us this structure. Hypercubes of order three or four are probably 
small enough to avoid difficulties that might arise as many nodes contend for the 
same medium. 

The store-and-forward approach does not require the availability of the entire 
path between source and destination nodes. Instead, each node along the path ac- 
cepts the entire message in turn and then forwards it to the next node in the path. 
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This requires the use of no more than one link at a time. For a many-node environ- 
ment (particularly if there is little structure or the potential of dynamic routing), this 
approach would seem to ofTer some advantages over the circuit switching approach. 

The routing criteria is separate from the type of switching used. Either of 
the two general approaches described above can support many forms of routing. 
Deterministic approaches alone include many methods. For the hypercube topology 
with Gray-coded node labels, it is probably useful to combine the Gray code with 
the notion of Hamming distance to arrive at a shortest path route. Even with this 
approach, there are as many optimum paths between two nodes i and j as the 
Hamming distance, H(i,j), between them. [Ref. 39: p. 7]. If a dynamic scheme 
is used to determine the path, there are even more combinations of potential paths 
from i to j. Usually a dynamic approach considers media utilization, “hot spot” 
avoidance, and so on. 
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APPENDIX C 



INTERCONNECTION TOPOLOGIES 



Multiprocessor computing brings with it a fundamental concern: interproces- 
sor communication. Communication is — to any designer of computing machinery 
or software — a burden and hindrance. An interconnection topology describes the 
network that handles this load. The hypercube is one of the many topologies used 
in multiprocessor computing. It has been the subject of both hype and criticism. 
Nevertheless, this particular scheme possesses the qualities that quickly draw the 
attention of mathematicians and parallel programmers. The hvpercube’s struc- 
ture and simplicity make it dependable and predictable. The same properties that 
enable the hypercube to endure the rigor of mathematical proof lead to practi- 
cal solutions in parallel programming. This discussion describes the hypercube 
topology and explores some of the the qualities that make it a practical choice for 
multiprocessor computing. 

A. A FAMILIAR SETTING 



Organizing processors into a suitable topology is analogous to the familiar prob- 
lem of organizing personnel into groups. An independent worker has limited capacity, 
so we often set more hands (or machinery) to the task for productivity’s sake. Groups 
of people are often less efficient. Efficiency is a ratio of time spent doing useful work 
to the total time spent. Other metrics might work, but time is universally recog- 
nized as the standard against which productivity is measured. Dependence upon 
others requires communication and consumes time. The loss may be mini- 
mized, but not avoided. Any group working toward a common goal must deal with 
this problem. To be efficient, an organization must possess structure and media for 
communication. 

People spend time on meetings, paperwork, and peripheral pursuits — all for 
the sake of an organization that hopes to outperform the individual. Organizations 
typically perform tasks that are simply impossible for an individual. To be sure, an 
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individual often possesses the independence and efficiency that makes him the proper 
choice. There are tasks that seem to fit one or the other and — while there is some 
crossover in ability — we aren’t likely to get rid of either organizations or individual 
workers soon! This is worth considerable attention. Individuals and organizations 
are chosen for different tasks. 

These ideas apply in the world of parallel processing. First, there are many 
tasks. Some fit nicely onto a single processor. Others beg a parallel solution. Finally, 
some have natural solutions by either method. Even when one of these options is 
selected, there are many ways to solve the problem. If a multiprocessor is used to 
solve the problem, the issue of communications will be unavoidable. 

An interconnection topology must carry the burden of interprocessor communi- 
cations. There are many schemes for handling this mission. This discussion focuses 
on one design that fulfills that mission: the hypercube. To forestall confusion: the 
subject is an interconnection topology, not a particular vendor’s product. 

B. APPEAL TO INTUITION 

Productivity can suffer when the members of an organization communicate 
excessively. A lack of communication can also reduce efficiency. In a network of 
processors, lines of communication (links) are literal. The system will not be flexible 
if there is a shortage of links, but with too many links a message could get delayed 
or lost in the confusion. The hypercube attempts to strike a balance. 

Hypercubes come in different sizes. In fact, scalability is a key characteristic of 
the hypercube. It allows the designer to tailor a network to a problem. There are 
several ways to express the cube’s size: order is one measure. The term “hypercube 
of order n” (usually called an n-cube) is filled wdth meaning. A more detailed de- 
scription is given later, but pictures provide the most direct introduction. Figure C.l 
shows hypercubes of order n where n € {0, 1,2,3}. 



140 




Figure C.l: The Four Smallest Ilypercubes 



This illustration is important. The hypercube shows geometry, structure, and 
symmetry. A few observations nearly jump out of the pictures. One can see several 
terms of a geometric series developing. There is also a recurrence relation at work 
in the building of hypercubes. Intuition suggests the use of well-oiled mathematical 
tools to analyze the hypercube. 



C. TOOLS 



Many benefits may be derived from a few definitions, conventions, and tools 
(that suit the hypercube’s structure). Figure C.2 demonstrates the utility of Carte- 
sian coordinates in n-dimensional space. 

The picture is deceptively simple, but worth careful study. Figure C.2 shows a 
unit cube in three dimensions. The vertex labels express ( xyz ) position in the coor- 
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Figure C.2: Cartesian Coordinates for a 3-Cube 

dinate system. The labels also form a binary (Gray) code that is somehow equivalent 
to coordinate labeling of a cube in n-dimensional space. The issue of communica- 
tions invoked this discussion, so distance must be addressed. A comparison of the 
binary labels of any two nodes reveals that the distance between the nodes is equal to 
the number of bits that differ in the labels. This measure, called Hamming distance, 
and the Gray code are presented in more detail later. 

This brief introduction is just enough to embark upon a more precise descrip- 
tion of the hypercube. The ideas of a coordinate system, node labeling, and distance 
are fundamental. Graph theory also finds application in topology design. In the hy- 
percube these four tools complement each other nicely. Despite their simplicity they 
can be explored in almost endless detail, even within the constraints of hypercube 
structure. 
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D. DESCRIBING THE HYPERCUBE 



The hypercube interconnection topology cannot be captured in a one-sentence 
definition. A definition is often inappropriate for material objects. A description 
given from several perspectives may be more useful. This is the case with topologies. 
Each tool introduced above has its own utility. In a sense, each takes up a particular 
perspective. A meaningful characterization of the hypercube can be achieved by 
combining these perspectives. 

The geometric view is most useful for visualizing the cubes. Despite its ten- 
dency to break down (with three-dimensional limitations), geometry's intuitive ap- 
peal is indispensable. Geometry and pictures lay the foundation for the setting of 
an undirected graph. Figures C.l and C.2 take advantage of geometry, but three- 
dimensional sketches begin to lose their appeal as order increases. Nevertheless, 
geometry and visual models hold an important place in describing the hypercube. 
They furnish us with (a) examples for comparison, and (b) expectations that are 
useful in the transition to a more general description of the topology. 

A hypercube of order n may be described as a set of 2 n points (vertices, nodes, 
or processors) connected by a set of edges. The points are each given an n-bit 
binary label, b n . . . 6 3 6 2 6]. Thus the hypercube’s node labels exhaust all possible n- 
bit binary combinations. Furthermore, the labeling convention used in Figure C.2 
describes the point’s n-dimensional Cartesian coordinates. 

The hypercube edge set (communication links) includes an edge between every 
pair of points p, and Pj whose binary labels differ in exactly one bit position, say b *. 
That is, adjacent nodes have a Hamming distance of one. This measure of distance 
proves especially convenient in the hypercube, and it can be thought of in several 
equivalent ways. A first definition of Hamming distance is the number of bits that 
differ in the two labels. Equivalently, it is the number of l’s in a bitwise exclusive 
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or (XOR) of the numbers. Figure C.2 contains an example. Let p, be the point 
labeled 100 and Pj be 110. The binary labels differ in exactly one bit position, 
namely b 2 (the second bit). The points are neighbors (one hop from each other in 
communications terms). [Ref. 40] 

Despite the appeal of the geometric approach, it holds limited value in a gen- 
eral n-dimensional space. Consider n = 4 in three dimensions. Typical illustrations 
show the sixteen-node cube as a cube inside a cube with connections between corre- 
sponding nodes of the inner and outer cubes. An equivalent diagram would display 
two 3-cubes side-by-side with connections to corresponding nodes. Nevertheless, it 
seems that an n-dimensional coordinate system is the most convenient environment 
for sketching the hypercube of order n. 

E. GREATER DIMENSIONS 

Three-dimensional sketches become difficult to manage. The time comes for a 
change of method. Some of the finest tools available for spanning such a gap are 
recurrence relations and the principle of mathematical induction. The approach is 
not extremely formal, but those so inclined will not find it hard to add the formalities. 

Induction can be used to generate a Gray code suitable for labeling the nodes 
of a hypercube. This code and the Hamming distance can be used to determine 
the cube. The first topic is a procedural description of how to build hypercubes. A 
Gray code construction procedure will follow. If the two topics appear similar, it is 
because they are completely equivalent (assuming that the Gray code is combined 
with the concept of Hamming distance). 

Constructing a hypercube of order zero is trivial. This is not important except 
that it leads to greater things (i.e., it is the basis for induction). Second, suppose 
that this hypothesis for induction is true: “we know how to construct any hypercube 
of order k where 0 < k < n”. Induction forms a hypercube of order n using this 
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base case and hypothesis. This can be done in three steps: 

• Replicate the Hypercube of Order (n — 1) so that there are two identical copies. 

For concreteness, one will be copy number 0 and the other will be copy number 
1. The hypercubes have nodes each. 

• Prepend the copy number to the existing node labels. That is, place a leading 0 
in front of the labels for each node of copy 0 and place a 1 in front of every node 
label in copy 1. Now every node in one copy has a corresponding node in the 
other copy. These corresponding nodes are separated by a Hamming distance 
of one. That is, the last (n — 1) bits are the same for corresponding nodes and 
they differ only in the prepended copy number. 

• Connect all nodes whose labels differ only in the prepended copy number. This 

adds edges between the two copies. 

F. GRAY CODE GENERATION 

The procedure above generates hypercubes. By focusing on the vertex labels, 
Gray code generation can be discussed. A Gray code is a cyclic list of all of the n-bit 
numbers which changes in only one bit from one number to the next [Ref. 40]. Since 
the code is binary, there are 2 n numbers in the list. The starting point is arbitrary 
(it is cyclic) but I have started with zero. Perhaps the best explanation of Gray 
codes comes in the construction of one. As in the construction of hypercubes, a base 
case is required to begin generation. 

• Start with 0. This is a one-bit number (n = 1) so the one-bit Gray code must 
have a total of 2 1 = 2 numbers. The other is 1. Next, the hypercube building 
steps established above are applied with slight modification. 

• Given the one-bit case, it is easy to generate the n = 2 code. Write down the 
previous code and draw a line below it. Next, form a copy by reflecting the code 
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TABLE C.l: GRAY CODE GENERATION 




downward across the line. Place a zero in front of eacli number in the previous 
code (above the line), and a one in front of each number in tin' new copy (below 
the line). 

• This is a Gray code for n = 2. Table C.l extends the idea. The list is cyclic, 
each number consists of n bits, and the list contains all 2 Tl possible numbers. To 
construct the code for larger n, the process may be applied repetitively. Copy 
by reflecting the (n — l)-bit code downward across a line, prepend a zero to 
everything above the (most recent) line, and prepend a one to those below that 
line. 



The Gray code is probably the most useful node labeling to attach to the hyper- 
cube. This code often appears in implementation. The program listing that begins 
on page 152 shows one way to generate the code. It can be used, for instance, as the 



backbone of a routing function in a network. Labels with a Hamming distance of one 
mark neighbors in the hypercube. What about the labels of two nodes that differ 
in exactly k bits (i.e., have a Hamming distance of k)? It turns out that k is the 
distance (number of edges) between these nodes. For all communications between 
these nodes, the shortest path will involve k hops. 

This also indicates that, for an /r-cube, there is no pair of nodes that have 
a Hamming distance of more than n (e.g., communication between nodes 0000010 
and 1111101 in a 7-cube can be achieved in seven hops). The greatest distance 
across the n-cube is n hops. In fact, for each node in a hypercube, there is a unique 
corresponding node at a Hamming distance of n. Also, there are n nodes at a 
Hamming distance of one from each of the hypercube’s nodes. 

Two approaches have been considered so far: sketching cubes in n-dimensional 
Cartesian coordinates and studying the labels associated with the cubes. Though 
the approaches are fundamentally different, they arrived at many of the same conclu- 
sions. Careful application of the Gray code and Hamming distance could produce a 
nearly endless string of results, but it is more convenient to introduce some material 
from the study of graphs at this point. Graph theory combines the two approaches: 
it looks at the pictures and studies the numbers as well. The small hypercubes 
described with earlier methods are given graph representation in the illustration of 
Figure C.3. 

G. GRAPHS OF HYPERCUBES 

Graph theory is, of course, much more sophisticated than the small subset 
used here. Buckley and Harary provide a valuable source [Ref. 41]. This discussion 
exposes a few salient features of the hypercube from the perspective of graphs. 

A graph, //, consists of a vertex set, V(//), and an edge set, E(II). The vertices, 
or nodes, in the multiprocessor network model are the processors. The edges are the 



147 




Figure Cd$: II ypercube Graphs 



links dial connect the processors. 1 will avoid using the term ordtr in its graph 
theory sense (he., number of nodes) so that it cannot be confused with the order of 
the liypercnhe. Consider the graph, // n , of a hypercnbe of order n. The graph has 
t hose characterist n s: 

• There are 2" nodes, 1 his means that the number of nodes (i.e., processors) 
grows very quickly with order. 

• I very vertex, i\ in Il u has eccentricity < (r) = n. Eccentricity is the distance 
to a noth' farthest from i\ Additionally, each node in a hypercnbe has exactly 
one eccentric (farthest) node. This property means that, hypercubes are unique 
eccentric node (u.e.m) graphs. 



I IS 



• The radius of a graph is the minimum eccentricity of the nodes and diameter is 
the maximum eccentricity. The hypercube is self-centered, meaning its radius 
and diameter are the same: r(H n ) = d(H n ) = n. This is significant because it 
says that worst-case communications distances only grow like the order of the 
hypercube. 

• Connectivity is a measure of reliability or fault tolerance in multiprocessor net- 
works. The connectivity of a hypercube is equal to the order of the cube, n. 
The edge connectivity is also n (each node has n incident edges). 

Counting the number of nodes in a hypercube is easy. The hypercube construc- 
tion process also points to a recurrence relation that reveals the number of edges 
in a hypercube. The initial case, of course, is the hypercube of order zero with no 
edges. After this, the number of edges can be expressed in terms of the size of the 
previous cube. Suppose a hypercube of order n has q edges. Then the hypercube of 
order (n + 1) will have 2 q -f 2" edges. This is because the construction procedure 
calls for two copies and 2” edges between them. 

Figure C.4 provides an example. This is the graph, // 4 , of the hypercube of 
order four. All of the characteristics given above are evident. Additionally, a Gray 
code labeling of the nodes is given. The recurrence relation above is useful, but it 
retains a dependence upon q. A more convenient formula would depend on n alone. 

In fact, there is a simple formula for the number of edges in the graph of a 
hypercube, but it requires a closer look at the recurrence relation. In more formal 
terms: let q(n) represent the number of edges in a hypercube of order n. Then: 

, , J 0 if n = 0 

q ^ n ’~ \ 2q(n- l) + 2t"- , > if n > 1 ' 

This can be expanded and shown equivalent to: q(n) = n(2^ -1 ^). Table C.2 
provides an example. 
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TABLE C.2: NODES AND EDGES FOR A HYPERCUBE 



Order 


Number of Nodes 


Number of Edges 


0 


1 


0 


1 


2 1 = 2 


2(0) + 2° = 1 


2 


2 2 = 4 


2(1) + 2 1 = 4 


3 


to 

GO 

II 

QO 


2(4) + 2 2 = 12 


4 


2 4 = 16 


2(12) + 2 3 = 32 


5 


2 5 = 32 


2(32) + 2 4 = SO 


6 


2 6 = 64 


2(80) + 2 5 = 192 


7 


2 7 = 12S 


2(192) + 2 6 = 448 


(n - 1) 


2 (n-l) 


9 


n 


2" 


2<7 + 2< n - 1 ) 
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Figure C.4: Graph of a 4-Cube 



H. SOURCE CODE LISTINGS 



A listing of the Gray code generation program gray.c follows. 
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PROGRAM INFORMATION 



gray.c 



1 /♦ 

2 * 

3 * 

4 * 

5 * 

6 * 

7 * 

6 * 

9 * 

10 * 
11 * 
12 * 

13 * 

14 * 

15 * 

16 * 

17 * 

18 * 

19 */ 

20 
21 

22 /* 

23 * 

24 * 

25 * 

26 * 

27 * 

28 * 

29 * 

30 * 

31 * 

32 * 

33 * 

34 * 

35 * 

36 * 

37 * 

38 * 

39 * 

40 * 

41 * 

42 * 

43 * 

44 * 

45 * 

46 * 

47 * 



SOURCE 

VERSION 

DATE 

AUTHOR 

USAGE 

REFERENCES 



gray.c 

1.2 

01 August 1991 

Jon Hartman, U. S. Naval Postgraduate School 
gray 



Cl] 



Hamming, Richard W. “Coding and Information Theory", 2nd edition, 
edition, Englewood Cliffs, N.J.: Prentice-Hall, 1986, pp. 97-99. 



============== DESCRIPTION ============== 

This program generates and displays the Gray code described in [l] . 



=============== ALGORITHM =============== 

Consider a b-bit Gray code beginning at zero. Let j be an integral index 
such that 0 <= j < b. Consider two b-vectors , mod_counter[] and bin[] . 
Each element, mod_counter [j] , holds a count mod (2~(j + l)). Initially we 
shall set mod.counter [j] = (2~j). Furthermore, let the elements of bin[] 
represent a binary number in the natural way. That is, each element, 

bin[j] will be either 0 or 1 , and bin[] will be formed so that the sum, 

( 2" 0 * bin[0] + 2'1 * bin[l] + 2“2 * bin[2] + ... ), represents the 
'value 1 of bin[]. We have elected to start the code at zero, so let 
bin[] be set to zeros initially. Next perform this algorithm: 

for (i = 0; i < (2~b) ; i++) { 

Print the “binary number" represented by bin[] . 

for (j = 0; j < b; j++) { 

Let mod_counter [j] = (mod_counter [ j] + 1) mod (2~(j+l)) 

If mod_counter [j] == 0, then toggle the bit in binCj] 

(i.e., bin[j] = (bin [ j] X0R l) ). 

} end for(j) 

} end for(i) 



4ft * 

49 * 

50 */ 
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grav.c 



51 

52 

53 #include <stdio.h> 

54 

55 

56 



57 tifndef EXIT.FAILURE 

58 #def ine EXIT.FAILURE 

59 #endif 

60 
61 

62 #if ndef SUCCESS 

63 #def ine SUCCESS 

64 #endif 

65 

66 



67 #define P0W2(n) 



68 



69 

70 

71 



1 



0 



((1) « (n)) 



72 

73 mairi() { 

74 



75 


int patience = 5; 


/* 


there's a limit to my patience! 


*/ 


76 

77 


long b = 0, 


/* 


as in b-bit Gray code 


*/ 


78 


♦bin , 


/* 


as described above 


*/ 


79 


i. 


/* 


generic integral values 


*/ 


60 


j. 








61 


1. 


/* 


length of Gray code (2~b) 


*/ 


62 


♦mod_counter ; 


/* 


as described above 


*/ 



83 

64 

65 printf ( M \n\n\n\n\n\n ==== "); 

66 printf ("This program generates the binary numbers of a Gray code. ") ; 

67 printf ("==== \n\n\n"); 

68 

69 printf (" Successive numbers in a Gray code differ in exactly "); 

90 printf ("one bit position. \n") ; 

91 

92 printf ( M The list generated by this program will be complete. ") ; 

93 printf ("That is, if you\n"); 

94 

95 printf (" request the code of numbers that are b-bits long, *'); 

96 printf ("you will get a list\n"); 

97 

98 printf (" of (2~b) binary numbers, starting with zero . \n\n\n") ; 

99 
100 
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gray.c 



101 /* The sole purpose of this vhile() loop is to get the value of b */ 

102 while (b <= 0) { 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 

126 > 

129 

130 

131 /* Allocate storage for the arrays, test to see if it worked */ 

132 bin = (long*) calloc (b, sizeof (long) ) ; 

133 mod_counter = (long*) calloc (b, sizeof (long) ) ; 

134 

135 if (('.bin) || ( !mod_counter) ) { 

136 

137 printf ("mainO : Allocation failure bin[] or mod_counter [] An”) ; 

136 exit (EXIT_FAILURE) ; 

139 > 

140 

141 

142 /* Initialize mod.counter [] */ 

143 for (i = 0; i < b; i++) mod_counter [i] = P0V2(i); 

144 

145 printf (" Gray code for */,ld bits will generate ", b) ; 

146 printf ("'/.Id numbers . \n\n\n" , 1); 

147 printf (" Press RETURN to continue...."); 

146 f f lush(stdin) ; 

149 i = getc(stdin) ; 

150 printf ("\n\n\n") ; 



printf (" Please enter desired length (binary digits): ") ; 
scanf ("*/.d", Ab) ; 
f f lush(stdin) ; 
printf ("\n\n") ; 

if (b > 0) { /* else ask again (patience permitting) */ 

1 = P0V2 (b) ; 

if (1 <= 0) { /* guard against too many left shifts! */ 

printf (" The acceptable range is "); 
printf (" 1 . .*/.d . ", (sizeof (long) *8-2) ) ; 

printf ( "Please try again . \n\n\n") ; 

b = -1; 

> 

> 

if (--patience <= 0) { 

printf (" Ran out of patience ! \n" ) ; 
exit (EXIT_FAILURE) ; 

> 

/* end while (b <= 0) */ 
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gray.c 



151 /* Do the for() loop spoken of in the '’ALGORITHM'' section above */ 

152 

153 for (i = 0; i < 1; i++) { 

1 54 

155 /* Print the binary representation held in bin[] */ 

156 printf ("\t") ; 

157 

156 for (j = (b-l); j >= 0; j — ) { printf ("•/•ld M , bin[j]); > 

159 

160 printf ("\n") ; 

161 
162 

163 /* Adjust the counters using addition mod (2~(j+l)) and toggle the 

164 * corresponding bit in bin[] whenever an element of mod_counter[] 

165 * reaches zero. 

166 */ 

167 for (j = 0; j < b; j + + ) { 

166 

169 mod_counter [j] ++ ; 

170 

171 if ( (mod_counter [j] '/,= P0W2(j + l)) == 0) bin[j] “= 1; 

172 > 

173 > /* end for(i) */ 

174 

175 free(bin); 

176 f ree (mod.counter) ; 

177 

176 return(SUCCESS) ; 

179 > 

160 /* = = =========== EOF gray.c ============= */ 
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APPENDIX D 



A SPARSE MATRIX 



Partial differential equations can be used to characterize many physical prob- 
lems. Explicit solutions to these problems are often quite complicated, so alterna- 
tive approaches warrant our attention. Simple matrices exist as legitimate repre- 
sentatives of complex problems. A system of linear equations can be constructed 
to give a discrete approximation to the problem. The structure of the physical 
setting guarantees that the corresponding matrix of coefficients will be sparse and 
symmetric. Why does this happen? When do we have the right to expect such a 
simple matrix? Where does the matrix come from and what does it mean? 

This discussion explains how to construct the matrix of coefficients and vec- 
tors that describe the numerical approximation to an elliptic partial differential 
equation. Poisson’s equation in two dimensions is used to demonstrate the process. 

The first step uses a finite difference approximation to produce a system of equa- 
tions. The system is fine-tuned and the matrix of coefficients is extracted. The 
process reveals the origins of structure and shows why the matrix is sparse and 
symmetric. 

A. LAPLACE AND POISSON 

To most engineers, mathematicians, and scientists, Laplace and Poisson are 
familiar French names. Pierre-Simon de Laplace (1749-1827) and Simeon Denis 
Poisson (1781-1840) made sizeable contributions to several fields. In a moment, the 
discussion turns to partial differential equations named in honor of these gentlemen. 

If the material seems a bit difficult, the following quote from [Ref. 42: p. 10] 
may provide some encouragement. The ideas are not so obvious to everyone as they 
may have been to Laplace. 

Nathaniel Bowditch (1773-1838), an American astronomer and mathemati- 
cian, while translating Laplace’s Mecanique celeste in the early 1800s, stated, “I 
never come across one of Laplace’s i Thus it plainly appears ’ without feeling sure 
that I have hours of hard work before me to fill up the chasm and find out and show 
how it plainly appears * 
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The next several pages are dedicated to showing how the matrix representation 
of a partial differential equation plainly appearsl The objective is to describe a 
particular physical problem, then convert it to the equivalent matrix representation 
using a deliberate, step-by-step approach. 

B. EQUATIONS 

Laplace and Poisson worked with partial differential equations that can be ob- 
served in nature. What kinds of natural phenomena can be described with partial 
differential equations? This section gives a brief answer to this question. The dis- 
cussion includes the natural setting, the equations, and a quick look at the variables 
and constants involved. The link between the equations and their physical meaning 
is critical, so this aspect must be developed. The heat equation has one of the most 
intuitive physical interpretations available, so it is used as a starting point. After 
developing a general perspective, the field can be narrowed to a particular example — 
Poisson’s equation. Such a limited survey of partial differential equations can only 
hope to succeed by appealling to the reader’s experience and intuition. 

1. Heat 

Before looking at a partial differential equation, let us recall some plane 
geometry. The intersection of a plane and a cone(s) provides many interesting shapes 
and equations. Consider the equation that describes all points equidistant from a 
point (focus) and a line (directrix): 

V = (£) * 2 + (EU) 

This is a parabola whose focus and vertex both lie on the y - axis (the axis of the 
parabola is the t/-axis). The focal length is c and the vertex is located at ( 0,k ). 
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Partial differential equations are classified using conic sections much like 
equations in the xy-plane. Introductions to partial differential equations often begin 
with the heat equation: 



du d 2 u 

= K-r-r + Q(x , t). 



dt 



Ox 2 



(D.2) 



This is an example of a parabolic partial differential equation. Note the similarity of 
equations (D.l) and (D.2). 



a. Definitions and Notation 



The heat equation describes the temperature, u(x,t), in a “thin rod” 
(the single dimension x appears in the equation). The presence of t indicates depen- 
dence upon time. If there is a heat source (or sink) present, it is represented by Q. 
We can see that Q may be a function of x or f or both. When mass density (/>), 
specific heat ( 5 ), and thermal conductivity (I\) are known; the thermal diffusi vity, 
k, can be determined using the following relation: 



k = 



I<_ 

sp 



(D.3) 



b. Houses and Heat 



From our youth, we have observed several important properties of heat 
flow. The lessons are simple, few in number, and can be observed from the comfort 
of our home. First, heat energy only flows when there is a difference in temperature. 
If the temperature outside is the same as the indoor temperature, no heat energy will 
cross the threshhold (even w’ith the door open). A temperature difference represents 
an instability and heat will flow to counter this situation. 

When heat does flow, it goes from hotter to colder regions. The loss of 
heat energy from the w r armer region reduces the temperature there, and the tem- 
perature in the colder region rises as it gains heat energy. The transfer of heat 
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has a stabilizing effect (the environment will not be at rest as long as temperature 
differences exist). We do not find the changes in temperature surprising, but our 
conversation indicates confusion concerning the direction of the flow. Most of us have 
heard someone say: “Close the door, you’re letting cold air in!”. We understand that 
this statement is not correct, but it seems to persist from one generation to the next. 

In addition to the idea that heat flows in the presence of temperature 
differences (gradients), we clearly understand that larger differences are related to 
greater heat flow. On a very cold Winter day, the parent notices more quickly that the 
child left the door open (and displays more urgency in shutting it). In other words, 
the effect of heat flow is to balance differences in temperature and it somehow “works 
harder” when there is a greater difference to balance. In mathematical terms, we 
would suspect (correctly) that heat flow is proportional to temperature difference. 

Finally, we recognize an ability to restrict heat’s ever-present balancing 
efforts. Sometimes we want an imbalance in temperature, and we often use insulation 
to maintain this imbalance. When we shut the door, we expect that it will slow 
the transfer of thermal energy through the doorway and enable us to maintain an 
acceptable imbalance in temperature. For the same reason we use special materials 
in the construction of refrigerators to keep heat out, and in ovens to keep heat energy 
inside. This means that the effectiveness of heat transfer is subject to properties of 
the medium (air, glass windows, fiberglass insulation, wood doors, steel, styrofoam, 
and so on) through which it flows. 

c. Heat Flux 

The right-hand side of the heat equation looks a bit complex, but it 
merely captures this idea of heat flow. Before tackling the second partial derivative 
of u with respect to x, think about the first partial derivative. The first partial 
derivative of u with respect to x (scaled by the thermal conductivity, K) describes 
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movement of thermal energy. This flow of heat is usually called heat flux , denoted 
</>, and can be calculated using Fourier’s law of heat conduction: 

*— K & < D - 4 > 

Heat flux is a measure of how much thermal energy per unit time is 
moving to the right per unit surface area (by convention, flow to the left is assigned 
a negative value and flow to the right is positive) [Ref. 43: p. 3]. The second partial 
derivative measures changes in flux with respect to position. In other w'ords, it 
represents increasing or decreasing flux. 

d. Heat Equation Summary 

Let us carefully reassemble the pieces of the heat equation (D.2) to see 
if the theory agrees with experience. Temperature has spatial and temporal depen- 
dencies. The left-hand side describes changes in temperature over time. Changes in 
heat flux are captured in the second partial of u that appears on the right-hand side. 
Flux, heat energy in motion, acts to equalize temperature. The thermal diffusivity, 
«, measures the material’s resistance to heat flux. That is, a temperature difference 
activates the flow of heat but the speed and effectiveness of this flow is moderated by 
material properties. Considering everything, then, the heat equation can be stated 
in one (long) sentence: Changes in temperature over time are caused by (equal to, 
due to, related to) changes in heat flow (moderated or accelerated by properties of 
the material) and thermal source(s). 

2. Notation 

With two or more dimensions, the same equations that looked simple in one 
dimension can begin to look complex. The linear operator, A, is used to simplify 
the notation. For example, A u, substituted into the right-hand side of (D.2), gives 
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the heat equation a new look: 



Oil 

■f = kAu + Q(i,l) 



(D.5) 



This is a more general equation since the linear operator A u can be applied in any 
number of dimensions. For instance (in three dimensions), 



d 2 u d 2 u d 2 u 

" = a? + W + d? 



(D-C) 



Sometimes this operator is called the Laplacian of u and some authors use the del 
operator, V, in these equations (V 2 u = Au). 

3. Diffusion 



The behavior of thermal energy is actually a special instance of diffusion , 
so (D.5) is often referred to as the diffusion equation. With an appropriate substi- 
tution for k, the equation might describe the spreading of dye through ocean water. 
In an agricultural application, it could characterize water or chemical penetration 
in soil. We shall continue to use the term “heat equation”, though, for the sake of 
consistent terminology and notation. 

4. Laplace’s Equation 



Consider the effect of a few restrictions on the heat equation. Suppose that 
there is no source of thermal energy ( Q = 0) and the physical properties of the 
material do not vary (n is constant). Finally, what happens if the time-dependency 
is removed? 

The left-hand side of the equation goes away. This is not so unrealistic. 
Systems may reach a steady (equilibrium) state after a time (especially in the absence 
of sources). We can divide through by k (assuming k ^ 0) and the equation becomes: 



A d 2 u d 2 u n 
U ~ dx - 2 + dy 2 ~ ° 



(D.7) 
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This is Laplace’s equation in the two dimensions x and y. Sometimes it is called 
the potential equation since it also describes the cases in which u stands for 
gravity or voltage. It can also describe “steady-state heat flow. . . hydrodynamics, 
gravitational attraction, elasticity, and certain motions of incompressible fluids”. 
[Ref. 44 : pp. 660-661] 

5. Ellipses 



Although Laplace’s equation seems like a steady-state heat equation, it is 
fundamentally different. It falls in the elliptic class of partial differential equations. 
Consider an ellipse centered at the origin with foci (on the x-axis at a distance of c 
from the origin) located at (— c, 0) and (c, 0). Suppose that the foci are labeled F\ 
and F 2 . The major axis passes through the center and through the foci, connecting 
two vertices positioned at (— a,0) and (a,0). The minor axis passes through the 
center perpendicular to the major axis and connects the vertices at (0,-6) and 
(0,6). The major axis deserves its name since a > b (in the case of equality the 
ellipse degenerates and we get a special case — the circle). 

For any arbitrary point, p, let the distance dj be the distance from p to F\ 
and let d 2 be the distance from p to F 2 . Furthermore, let d = + d 2 . The ellipse 

is described by all points satisfying d = 2a, where a is the constant length of the 
ellipse’s semi-major axis as described above. The standard form for the equation of 
this ellipse is 



x 2 y 2 

~ r + = 1 



a‘ 



b 2 



(D.8) 



Using the distances from this ellipse, a right triangle can be formed with sides of 
length b and c and hypotenuse of length a. This means a, 6, and c are related by the 
Pythagorean Theorem. 
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Figure D.l: The Region 

6. Poisson’s Equation 

We have discussed several partial differential equations and observed the 
impact of changing a few parameters. Laplace’s equation showed what happens in 
the steady-state case when sources are removed and the thermal diffusivity is non- 
zero. Now we return to the more general problem that can be represented in the 
presence of a source, sometimes called a driving (or forcing ) function, say f(x,y). 

The result is Poisson’s equation (shown here in two dimensions): 

!£-•«'■»> < D ' 9) 

Again, u(x,y) typically represents temperature or voltage. Laplace’s equation (D.7) 
is just the special case of Poisson’s equation (D.9) where f(x,y) = 0. The rest of 
the discussion will focus on Poisson’s equation within the rectangular region (shown 
in Figure D.l): 0 < x < L, 0 < y < II. 
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Figure D.2: Subdividing the Rectangle 



7. Final Assumptions 

We shall assume that the conditions along the boundaries are known and are 
given by u — g(x,y). The problem is solved in the presence of a forcing function /. 
The goal is to produce something that a computing machine can “solve”. To reach 
this position, several steps are required. First, the domain is divided into many 
smaller regions. Using this subdivision scheme, a system of equations is developed. 
The information that is known (/ and g) can be moved to the right-hand side of the 
system. The system can then be represented in typical Ax = b fashion. 

C. DISCRETIZATION 

Before attempting a numerical solution, the domain must be subdivided into a 
finite (but probably large) number of elements. Figure D.2 provides an illustration 
of what this mesh looks like. We should not forget that actual applications may 
involve 100 (or more) divisions in each direction. Nevertheless, (artificially) small 
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examples are quite sufficient for conveying notation and measures within the region. 



1. Notation 

A clear understanding of the problem domain, conventions, and notation 
is prerequisite to developing the system of equations. Consider Figure D.2. This 
domain will serve as a reference for the upcoming discussion on conventions and 
notation. 

The rectangular region has length L = 9 and height II = 5. It has been 
subdivided into 45 smaller elements by a mesh made of four horizontal lines and eight 
vertical lines. The integers m and n are used to keep track of how many horizontal 
and vertical dividing lines are used (here m = 4 and n = 8). Each element has length 
h (in the T-direction) and height k (in the y-direction). In this particular example, 
the elements are (conveniently) square with h = k = 1. In general, the individual 
elements within the region are rectangular (it is not necessarily true that h = k). 

The elements within the region are uniformly spaced (each has the same 
size). L, //, h , and k do not need to be integers — they can be any convenient units. 
To guarantee uniform spacing, of course, L and II must be integer multiples of h 
and t, respectively. That is: 

L = (n + l)/i, n G {0, 1, 2, 3, . . . } 

H = (m + l)k, m € {0, 1,2,3,. .. } 

2. Internal Mesh Points 

Our goal is a system of equations, and ultimately a problem stated in terms 
of a matrix and vectors. We will eventually see that there are mn equations in mn 
unknowns, one for each internal mesh point (where the lines cross). Imagine elements 
of size h x k (as before) that are centered on these points, such as the cross-hatched 
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element at (7,3). Each equation in the system will correspond to one of these line- 
crossings and represent one of these elements. It is useful to label the lines for 
reference purposes. To accomplish this, we use the (integer) counters i and j. 

These counters are used to reference particular vertical and horizontal di- 
viding lines. The i counter refers to a vertical line (1 < i < n) and the horizontal 
lines are indexed by j (1 < j < m). Figure D.2 may be deceptively simple due to 
the element dimensions h = k = 1. Because of this, i = 7 indicates an x-coordinate 
of 7 and j = 3 means y = 3. But the counters i and j are not generally equivalent to 
x- and ^-position in the coordinate system. Given h, k , i, and j the corresponding 
coordinates are ( x,y ) = ( ih,jk ). 

D. A SYSTEM OF EQUATIONS 

The next step is to build a system of mu equations that describes the problem. 
First, we need to agree upon a referencing scheme for the internal mesh points. The 
numbering will be based upon i and j as defined above. This numbering scheme 
begins at the bottom left (i.e., i = j = 1), proceeds up the first column and then 
moves, column-by-column, to the right. Specifically, the points will be assigned a 
label 

C = m(i — 1) + j (D.10) 

Given the values i and j for any internal point, now we can assign it a label 
(1 < £ < mn). Figure D.3 shows values of i along the x-axis, values of j 

along the y- axis, and labeling of internal mesh points according to (D.10). 

1. Finite Differences 

The approach calls for analyzing each internal mesh point. Figure D.4 
shows the point referenced by i and j and its neighbors to the North, South, East, 
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Figure D.3: Numbering the Equations 



and West. We use a centered finite difference method to approximate the partial 
derivatives in (D.9) and arrive at the equations for these points. The finite difference 
approximations for the partial derivatives are: 



d 2 u ^ m.-u - 2 Ujj + 
dx 2 (, ,j) h 2 



cPv a 2 "m + «W (D.12) 

k ‘ 

The approximation for the partial derivative in the x-direction (D.ll) con- 
siders the neighbor to the West, the point itself, and the neighbor to the East. 
Similarly, the approximation in the y-direction (D.12) recognizes neighbors to the 
South and North in addition to the point. Both finite difference approximations 
favor the center point giving it twice the weight of its neighbors. 

Substituting these into Poisson’s equation (D.9) yields: 



( U i-1,j 2 Uij + u t+],j\ 2 Ujj + ‘ u i,j+\ ^ ^ A _ 

V L7 /V £2 / ~ ~ 



h 2 



■fij (D.13) 
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(*» j + 1) 




(* + 1 J) 



Figure D.4: Neighbors to the North, South, East, and West 

The forcing function, /,j, is known so (D.13) begins to look like one of many equa- 
tions in a linear system. There is such an equation for every internal mesh point. 
To make sure that we consider all of the internal mesh points in an orderly fashion, 
we may number them as in Figure D.3 and consider them one at a time. 

2. More Equations 



At this point, we know the general form (D.13) for each of the equations 
that must be considered. The matrix of coefficients may not be completely clear yet, 
so let us consider each of the equations in the order of their labels. For now, we will 
leave the i,j subscripts on everything: 



/ u 0,l — j + U 2 ,J . / u l,0 — 2uj ! + 2 

•( n ) - ( p ) ~ -/u 



h* 



/ u 0,2 — 2u! 2 + U 2t 2. ,Ui i — 2u 1>2 + Ui 3 

•( ) - ( 73 ) ~ -/1.2 



/l 2 



Jt 2 
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U0,m-1 — N , Wl t m-2 — 2lii m -i + Wi m . 

■( n ) - ( n ) « -/i.m-i 



} i 2 



P 



— 2Ui iIn + U2,m^ , ^1 ,m -1 — 2li i , m -f U\ m + i ^ f 

■l n J _ l )^2 J ~ 



} t 2 



/ u l.l — 2 ^ 2.1 + U 3 ,K , u 2,0 — 2 ti 2 ,i + ti 2 , 2 x , 

-( ^ ) - ( n ) ~ -h. i 



p 



Uj,2 — 2«2,2 + u 3,2s / u 2,l — 2»2,2 + ^2,3\ r 

-( n ) “ ( n ) ~ -/a . 2 



P 



P 



-( 



^l.m — 1 — 2 Ji 2 ,m — 1 + ^3,m-l s. /^2,m-2 — 2ti2,m-l *f u 2 



P 






it 2 



') ~ — /2,m-l 



-( 



^1 ,m 2 1 / 2 ,Tn "t" ^ 3 ,n 

P 



)-( 



u 2 ,m-l — 2 u 2 ,m + u 2 ,m + l 

P 



) ~ —fl,r 



, ^n-2,1 — 2u n _i,i + ti n ,l x / l/ n-l,0 — 2u n _i,i -f Un-1,2 x r 

■( n J - l r? ) ~ 



P 



k 2 



, u n— 2,2 — 2t/ n -l,2 + ti n ,2 , , ti n -l ,1 ~ 2ti n _l,2 + U n -1,3 x r 

■I n J - \ n J ~ -Jn-1.2 



h 2 



k 2 



-( 



^n — 2,m — 1 2ti n _i it7 ,_j “f - ti n , m — 1 ^ ^n— l,m 



— 2 2 ti n _ ] m — 1 H" ^ n — 



1 ,m 



P 



it 2 



)«-/. 



n — 1 ,m — 1 
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z^n— 2,m 2u n _i m T U n rn \ /^n — l,m — 1 2u n _j m "t" Un-l,m + l \ r 

-( ^ ) - ( H ) ~ fn—l,m 



A 2 



f — 1,1 


- 2u nil 


+ u n+1,l N 


, U n,0 


- 2u„,i 


T ^n,2 \ _ 


~fn, 1 


t 


h 2 


) 


{ 


A 2 


) ~ 


! u n — 1.2 


- 2u„,2 


+ u n+l,2 x 


,U n , 1 


- 2u„,2 




~fn, 2 


V 


h 2 


) 


\ 


A 2 


) ~ 



/^n-l,m- 1 2u n>m — 1 "1“ — 1 \ 2 2t/ n)m _i T l/n,m \ 

' i ~2 / _ V To / 



A 2 



it 2 



-/», 



m — 1 



! Un-l,m 2u n , m 4" ^ i U n m _] 2u n m 4" U njm .^] ^ 

— V To 1 — \ To / ~ ~Jn,r, 



A 2 



A- 2 



3. Modification 



The goal is to determine u,j for all internal points Having completed 

several foundational steps, we can see a developing system of mn equations. Let’s 
clean it up a bit. To do this, we need to make better use of one more piece of the 
given information — the boundary values. For those points just inside the boundaries 
(a horizontal distance of A from the sides and/or a vertical distance of A from the 
top or bottom) we already know part of the left side of (D.13). In particular, any 
subscript i = 0, j = 0, i = n + 1, and/or j = m + 1 signifies a (known) boundary 
point. 

Multiplying through by (AA) 2 and moving the known information to the 
right-hand side of the equations, we again start with the left-most column (? = 1) 
and work in the order of the labels. Now the system of equations looks like this: 
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k 7 (2iii i — t / 2 , i ) + h 7 (2ui t i — 1 , 2 ) ~ ~(hk) 7 f\,\ 4 ^ v o,i + ^ 2 Ui t o 



k 7 (2ui^ ~ ^ 2 , 2 ) + ^ 2 ( — Wi.i + 2 u 1j2 — ^ 1 , 3 ) ~ ” (^^') 2 /i,2 4 ^ 2 ^o,2 



^ (2t/i m~l ^2,m — l) 4" ^ ( ^l,m — 2 4“ 2l/i t rn — 1 ^l,m) ~ {kk} f\ } m — \ k ^0,m — 1 



fr 2 (2wi,m — w 2,m) + ^ 2 (~ ^l,m-l + 2l/i, m ) « “ (M)Vl,m + A’ 2 t/ 0>m + 



fc 2 ( — Ui,i + 2li ? ,i — ^3,l) 4* ^ 2 ( 2 1/ 2 , 1 — w 2,2) ~ ~ (^^) 2 /2,l + ^ 2 w 2,0 
^ 2 ( — ^1,2 + 2^2,2 — w 3,2) 4 ^ 2 (“ ^2,1 4 2 1^2,2 — ^2,3) ~ — {hk) 7 f?,? 



— w l,m-l 4 2li 2,T7i — 1 “ u 3,m-l ) 4 ^ ? ( ~^2,m-2 4 2u2,m-l “ u 2,m) ~ ~ {hk) 2 
k 7 {~ w l,m + 2l/2 t m “ w 3,m) + ^ 2 (“ u 2,m-l 4 2t/2,m) ~ —{JlkY fi %m + ^ 2 ^2,m + l 



/: 2 (— t/ n _2,i + 2u n _ 1} i — u nt i) 4 /i 2 (2i/ n _u — u n _i ? 2) ~ — (/i/:) 2 / n _i t i + /i 2 u n _i t o 
^ 2 (“ u n-2 } 2 4 2li n _ lt 2 — w n,2) 4 ^ 2 (” w n-l,l 4 2Un~\,2 ~ ^n- 1 , 3 ) ^ “ (^') 2 /n-l,2 
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k ( ^n— 2,m — 1 T2ti n _| ,m — 1 ^n,m — ( ^n — 1 ,m — 2 T 2ti n _] m _ ] ^n— 1 ,m ) ~ (/lA) fn—\ ,m — 1 

( ^n— 2,m _ l _ 2u r) _ j m ( ^n— 1 ,rn — 1 “t" 2u n _] ,m ) ~ (kk'j fn — l,m~kh U n _i iTn + ] 

I'‘(-!in-l,] + 2u ni i) -f /t 2 (2ii ni i — tZ ni 2 ) ~ —(hk') 7 f „' ] + k 7 U n+i< i -f h}u n; 0 

^• 2 (-Un-l,2 + 2u n 2 ) + /i 2 (-li„,l + 2u n ,2 - U„ l3 ) « —(hk) 7 f n j + A' 2 u n + li2 

^ ( — ^n— 1 ,m — 1 “I” ) T /i ( ^n.m— 2 T 2ti nifn _] U n m ) ~ ( ^ ^ ) fn,m — l~kk Wn+l,m-l 

I’ ( ^n — l,m d” ~Un,m) T k ( U n rn _i “I” 2u ntrn ) ~ (kk'j frx,m ”1" ^ ^n+l,m A ^ ^n,m + l 

Now the equations are very close to what we want. There are some unfor- 
tunate side effects to such a deliberate approach. The list of equations is tedious, 
the subscripts are a bit involved, and it takes some concentration to match things 
up. There are some benefits, though, for those who can endure! It will take very 
little effort to see how the coefficients are collected. 

E. MATRIX REPRESENTATION 

It is not hard to translate the preceding equations into the familiar representa- 
tion Ax = b. Notation is quite important. We will start with the obvious, exchanging 
u for x so that (eventually) the system will look like Au = b. Dimensions are impor- 
tant too. The goal is a large, sparse, symmetrix matrix A 6 9? mn x mn ■ The vectors 
u and b have the obvious dimensions and are assumed to contain real numbers as 
well. 
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1. Unknowns 



Since there is a great deal of structure in this problem, it is useful to 
partition the vector of unknowns, it. Let have the same meaning as it did 
in equation (D.13) and consider the m-vector: 

u i,2 



U, = 



^i,m — 1 

^»,m 

This vector captures all of the unknowns for a given column, i. of the original region. 
Now we can stack the columns, ?? in number, forming the entire vector u of unknowns: 



ui 

u 2 

it = 

— 1 

This process has clearly formed u £ 3? mn . Now we turn to the matrix of coefficients. 



2. Coefficients 



The matrix A is formed by combining two smaller matrices, T and D. First 
we shall consider the tridiagonal matrix T € For aesthetic purposes only, let 

the diagonal elements of T be d = 2 (h 2 + k 2 ). 



T = 



d -h 2 
-h 2 d -h 2 

- h 2 d -h 2 



-h 2 d -h 2 

-h 2 d - h 2 
- h 2 d 



Next, consider the diagonal matrix D € 3? mxrn : 
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-A- 2 



_P 

L K 

Forming the matrix A requires n identical copies of T and 2(n — 1) identical 
copies of D. The matrices in A below are assigned subscripts for counting purposes. 
The matrix subscripts, by the way, denote a value of i corresponding to the partition 
which the matrix will multiply. A is the block-tridiagonal matrix 



‘ T\ 


A 










A 


t 2 


A 










A 


t 3 


A 












A-3 A- 2 


A-i 










A — 2 


A-i 












A-i 


A . 



3. Knowns 



We could proceed immediately to the solution vector, b 6 3ft ,nn , using the 
equations provided in the previous section. Again, though, the result can be cleaned 
up a bit if we form b as the sum of three vectors /, v,w. 

The vector / € 5R mn represents the forcing function. The equations clearly 

indicate where the scalar multiplier comes from. 

/ i,i 
/ 1,2 



/l,m — 1 

f l,m 

f. 2,1 

h,2 



f = -( hk ) 2 



fn,m 
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Next, the vector v £ 3? mn is used to represent the information that is known 



due to the boundary values on the East and West sides of the region. 



u 0 .i 

uo,2 



V = 



u 0,m-1 

u 0,m 

0 

0 

u n+l,l 

u n+l,2 



^n+1 ,m 



Finally, the vector u> £ is used to represent the information that is 
known due to the boundary values on the North and South sides of the region. 



^1,0 

0 



0 



w — 



h 2 



u l,m + l 
^2,0 
0 



0 

U2,m + 1 
u 3,0 



u n,m + l 

Now 6 is a simple sum of these vectors: 6 = / + v + w. 
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F. CONCLUSION 



This process has shown a few examples of partial differential equations that 
appear frequently in nature. Poisson’s equation in two dimensions was selected as 
an example. After the finite difference approximation is selected, determining the 
system of equations is a tedious (but not too complicated) process. Once the system 
of equations is written down, the matrix representation is easy to come by. 
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APPENDIX E 



HYPERCUBE COMMUNICATIONS 

This report displays the results of point-to-point communications tests that 
were performed on the Intel iPSC/2 hypercube. The emphasis of the experiment 
was to evaluate several aspects of communications time. The exercise showed that 
communication on this machine is virtually independent of the Hamming distance 
between communicating nodes. There is clear evidence that transmission rates are 
related to message length (the transmission system favors longer messages) due — at 
least in part — to an overhead charged to begin the communication. Communications 
between the host and a node never achieve the rate that can be realized with node- 
to-node transmissions. 

The communications test code described in this appendix was only executed on 
the iPSC/2. Time did not permit modification of the code and testing on the trans- 
puter networks. A thorough test of communications and computational abilities of 
the T414 and T800 transputers has already been performed by Gregory Bryant. His 
masters thesis [Ref. 26] contains the documentation of this work. A short summary 
of Bryant’s findings is included in the conclusions to this appendix. 

A. SOURCE CODE OVERVIEW 

The host program (commtst.c) and a node program (commtstn.c) contain 
most of the code for this experiment. There is also a header file, commtst.h, shared 
by these codes,. Finally (but perhaps most important for any high-level survey of the 
code), the makefile commtst.mak shows dependencies and compilation procedures. 



In the discussion that follows, bold-faced type is used to indicate function and object 
names that actually appear in the code. 

B. STRATEGY 

The program must define the valid arguments. The function interpret_args() 
takes care of checking for occurrences of these arguments in the command line. 
When the arguments have been interpreted, we know how to set variables like reps 
(repetitions), bytes (length of the message to be passed), and verbose (to control 
how much data is spewed out). Once these values are known, the host instructs each 
node to either RECEIVE or SEND. A special Tasking packet (structure) carries 
instructions to each node independently. Only one node is designated to SEND 
at any one time; the rest RECEIVE. Receivers simply crecv() the given number 
of bytes and return the message to the originator by calling csend(). Since this 
involves a round-trip, the issue of timing requires attention. 

We can divide the time measurement by two (to account for the round-trip), 
provided we aren’t deceived by the outcome. That is, passing two fr-byte messages is 
not the same as passing a single message of length 26 bytes. To make the timing data 
credible, however, the round-trip method is essential. The precision of the mclock() 
function is an additional issue. At best, mclockQ is accurate to the millisecond (and 
ten milliseconds may be a more reasonable expectation). Very short messages can 
produce questionable results in terms of the precision of the timing data. 

For this reason, tests of short messages should be repeated a number of times 
within the block surrounded by time checks. This, of course, revives the same issue 
(multiple repetitions of a message are not equivalent to a single, longer message). 
We may proceed, however, provided we establish a common understanding of the 
problem domain and terminology. I have used the term effective time to capture this 
subtlety. 
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Wherever this term appears, it should be interpreted according to the following 



definition: 



t 




where t e is the effective time, i is the actual time measurement for the message, and p 
is the number of repetitions. The factor of two is included to account for the round- 
trip. For instance, suppose that the user asks for three repetitions of a message. The 
implementation carries this out in a for loop. Time is sampled before and after the 
loop. The inside of the loop is the simple csendQ and crecvQ sequence described 
earlier. The effective time in this example would be t ( = ij 6. 

In summary, there is no convenient (and credible) method for timing one-way 
communications. If we time one-way communications, the results could be mis- 
leading in that we could not be certain that the clock was starting just before the 
beginning of the csendQ and stopped immediately after the receiving node accu- 
mulated the final byte of the message. We must also consider the issue of blocking 
communication } Thus, the (round-trip) method is not so easily misled by the fact 
that csendQ is not actually blocking. The transmission duties are quickly handed 
over to a communication manager and processing continues directly. The crecvQ 
enforces blocking communications and execution stops at this function until the last 
byte has been acquired. Thus the round-trip method seems to be quite reliable, 
particularly in the case of node-to-node communications (if the host is involved, the 
results are less consistent). 

Since receiver nodes have nothing else to do but receive and retransmit the 
message, the performance loss due to the round-trip method should be (almost en- 
tirely) accounted for by two factors (loosely) placed into “software” and “hardware” 

*By definition, blocking means that the invoking process (send or receive) causes execution of 
the program to stop (be blocked from the CPU) until the communications requirement has been 
satisfied. 
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categories: 



• Software overheads like establishing and freeing the activation stack for functions 
(e.g., the csendQ and crecvQ functions). 

• Hardware overheads associated with establishing the communication path and 
performing switching. The take-down time for this task is probably negligible. 

Hence, if this method of analyzing communications performance errs, it does so on 
the conservative side. That is, the timing used in this method is liberal (if anything), 
so that communication rates will be estimated conservatively. 

C. RESULTS 

Considering the nature of the implementation, communications will be consid- 
ered bidirectional. In particular, the term “host-to-node” communications does not 
imply that the host is the originator of directed communication, but that a bidirec- 
tional exchange takes place between some node and the host. The host does send 
directed, one-way instructions to the nodes, but all timed communication originates 
at a node and returns to that node (even if it goes to the host). There are essentially 
three groups of results; each of which captures data for node-to-node communica- 
tions and host-to-node communications. 

1. Small Messages Repeated Ten Times 

The first test involved messages of length £ < 1,024 bytes. Since the 
shortest of these would not generate trustworthy timing data, the repetition count, 
p, was set at ten. This gave t e = t/ 20. Table E.l shows the results. 
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TABLE E.l: SHORT MESSAGES WITH TEN REPETITIONS 



Message 

Length 

(Bytes) 


Node-to-Node 




Host-to- 


Node 


t 

(msec) 


t c 

(msec) 


Rate 

(kbytes/sec) 


t 

(msec) 


te 

(msec) 


Rate 

(kbytes/sec) 


1 


7.10 


0.36 


2.75 


71.40 


3.57 


0.27 


2 


7.00 


0.35 


5.58 


79.40 


3.97 


0.49 


4 


7.00 


0.35 


11.16 


78.90 


3.95 


0.99 


8 


7.00 


0.35 


22.32 


75.80 


3.79 


2.06 


16 


7.20 


0.36 


43.40 


78.10 


3.91 


4.00 


32 


7.30 


0.37 


85.62 


79.40 


3.97 


7.87 


64 


7.70 


0.39 


162.34 


87.10 


4.36 


14.35 


128 


13.90 


0.70 


179.86 


132.10 


6.61 


18.93 


192 


14.30 


0.72 


262.24 


134.60 


6.73 


27.86 


256 


14.70 


0.74 


340.14 


137.50 


6.88 


36.36 


320 


15.30 


0.77 


408.50 


139.60 


6.98 


44.77 


384 


15.80 


0.79 


474.68 


142.40 


7.12 


52.67 


448 


16.20 


0.81 


540.12 


147.40 


7.37 


59.36 


512 


16.70 


0.84 


598.80 


180.30 


9.02 


55.46 


57G 


17.10 


0.86 


657.89 


201.50 


10.08 


55.83 


640 


17.60 


0.88 


710.23 


207.00 


10.35 


60.39 


704 


18.10 


0.91 


759.67 


20S.80 


10.44 


65.85 


768 


18.50 


0.93 


810.81 


204.50 


10.23 


73.35 


832 


19.00 


0.95 


855.26 


180.00 


9.00 


90.28 


896 


19.40 


0.97 


902.06 


152.30 


7.62 


114.90 


960 


19.90 


0.99 


942.21 


147.80 


7.39 


126.86 


1024 


20.40 


1.02 


980.39 


148.90 


7.45 


134.32 
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Figure E.l: Speed of Small Host-Node Messages (Ten Repetitions) 

a. Host-to-Node Performance 

The communication rates for small host-node messages with a repeti- 
tion count of len are illustrated in Figure E.l. Communications involving the host 
produce very irregular results (in the sense that the relationship between length and 
performance is not straightforward). The experiment was executed when only one 
user was logged in at the host and the results followed the same general pattern on 
repeated tests. 
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Figure E.2: Speed of Small Messages Between Nodes (Ten Repetitions) 



b. Node-to-Node Performance 

In the absence of contention for the communication medium, node- 
to-node communications within the cube are quite predictable. Figure E.2 shows 
transmission rates for small messages (up to one kilobyte) repeated ten times. 
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TABLE E.2: SHORT MESSAGES WITH ONE HUNDRED REPETITIONS 



Message 




Node-to- 


-Node 


Host-to-Node 


Length 


t 


te 


Rate 


t 


te 


Rate 


(Bytes) 


(msec) 


(msec) 


(kbytes/sec) 


(msec) 


(msec) 


(kbytes/sec) 


1 


68.60 


0.34 


2.85 


837.40 


4.19 


0.23 


2 


68.60 


0.34 


5.69 


818.30 


4.09 


0.48 


4 


68.70 


0.34 


11.37 


795.00 


3.98 


0.98 


8 


69.40 


0.35 


22.51 


774.50 


3.87 


2.02 


16 


70.30 


0.35 


44.45 


758.30 


3.79 


4.12 


32 


71.70 


0.36 


87.17 


737.10 


3.69 


8.48 


64 


75.30 


0.38 


166.00 


721.30 


3.61 


17.33 


128 


137.60 


0.69 


181.69 


1020.10 


5.10 


24.51 


192 


142.30 


0.71 


263.53 


1007.10 


5.04 


37.24 


256 


146.80 


0.73 


340.60 


1007.00 


5.04 


49.65 


320 


152.00 


0.76 


411.18 


1004.50 


5.02 


62.22 


384 


156.20 


0.78 


480.15 


1013.40 


5.07 


74.01 


448 


161.00 


0.81 


543.48 


1043.80 


5.22 


83.83 


512 


165.30 


0.83 


604.96 


1152.90 


5.76 


86.74 


576 


169.80 


0.85 


662.54 


1335.40 


6.68 


84.24 


640 


174.50 


0.87 


716.33 


1419.50 


7.10 


88.06 


704 


179.30 


0.90 


766.87 


1688.50 


8.44 


81.43 


768 


183.20 


0.92 


818.78 


1869.90 


9.35 


80.22 


832 


188.20 


0.94 


863.44 


1520.00 


7.60 


106.91 


896 


192.90 


0.96 


907.21 


1070.30 


5.35 


163.51 


960 


197.70 


0.99 


948.41 


1061.60 


5.31 


176.62 


1024 


202.40 


1.01 


988.14 


1048.80 


5.24 


190.69 



2. Small Messages Repeated One Hundred Times 



For the next experiment data was collected from runs using the same mes- 
sage lengths, but the repetition count, p, was raised to one hundred. This gives 
t e = </ 200, as shown in Table E.2. 



a. Host-to-Node Performance 

Figure E.3 gives the transmission rates corresponding to this data. 
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Figure E.3: Speed of Small Host-Node Messages (One Hundred Repetitions) 
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Figure E.4: Speed of Small Messages Between Nodes (One Hundred Repetitions) 

b. Node-to-Node Performance 

Figure E.4 shows the transmission rates for the node-to-node messages. 
This data may have important implications. Consider the transmission of a matrix 
row-by-row within a loop (where one row is transmitted each time through the 
loop). The expected communications performance is related to the number of bytes 
in a single row of the matrix, not the size of the entire matrix. 
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3. Larger Messages 

The final test considered longer messages (1,024 < £ < 202, 144) that were 
not repeated. This gives t e = t/2. Since the experiment was performed over a rather 
large set of message lengths, the data is divided at an arbitrary point. Messages 
of 64K bytes and less are designated “medium” length messages and placed into 
Table E.3. Messages of length 128K bytes and greater are designated “long” messages 
and placed into Table E.4. There is no hidden significance to this separation, it just 
made for tables of reasonable length. 

The figures that follow are based upon the combined data of both of these 
Tables. The host terminates execution at the crecvQ if we ask for more than 202,144 
bytes in a single message. Chapter 2 — iPSC/2 C Library Calls — of [Ref. 45: pp. 2- 
16, 2-19] explain: “messages to or from a host process are limited to a maximum 
of 256K bytes. There is no limit on message length between nodes.” This explains 
why the data stops at that message size. 
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TABLE E.3: MESSAGES OF MEDIUM LENGTH 



Message 


Node-to-Node 




llost-to- 


Node 


Length 


t 


t e 


Rate 


t 


te 


Rate 


(Bytes) 


(msec) 


(msec) 


(kbytes/sec) 


(msec) 


(msec) 


(kbytes/sec) 


1024 


2.20 


1.10 


909.09 


9.00 


4.50 


222.22 


2048 


2.80 


1.40 


1428.57 


10.40 


5.20 


384.62 


3072 


3.70 


1.85 


1621.62 


11.90 


5.95 


504.20 


4096 


4.40 


2.20 


1818.18 


13.40 


6.70 


597.01 


5120 


5.10 


2.55 


1960.78 


14.50 


7.25 


689.66 


6144 


5.80 


2.90 


2068.97 


14.50 


7.25 


827.59 


7168 


6.50 


3.25 


2153.85 


15.50 


7.75 


903.23 


8192 


7.40 


3.70 


2162.16 


16.50 


8.25 


969.70 


9216 


8.10 


4.05 


2222.22 


19.50 


9.75 


923.08 


10240 


8.80 


4.40 


2272.73 


18.00 


9.00 


1111.11 


11264 


9.50 


4.75 


2315.79 


18.90 


9.45 


1164.02 


12288 


10.30 


5.15 


2330.10 


19.00 


9.50 


1263.16 


13312 


10.90 


5.45 


2385.32 


19.60 


9.80 


1326.53 


14336 


11.80 


5.90 


2372.88 


20.30 


10.15 


1379.31 


15360 


12.50 


6.25 


2400.00 


21.90 


10.95 


1369.86 


163S4 


13.20 


6.60 


2424.24 


22.40 


11.20 


1428.57 


17408 


13.90 


6.95 


2446.04 


23.30 


11.65 


1459.23 


18432 


14.60 


7.30 


2465.75 


24.90 


12.45 


1445.78 


19456 


15.40 


7.70 


2467.53 


24.30 


12.15 


1563.79 


20480 


16.10 


8.05 


2484.47 


27.30 


13.65 


1465.20 


21504 


16.80 


8.40 


2500.00 


27.10 


13.55 


1549.82 


22528 


17.60 


8.80 


2500.00 


27.00 


13.50 


1629.63 


23552 


IS. 40 


9.20 


2500.00 


27.80 


13.90 


1654.68 


24576 


19.10 


9.55 


2513.09 


29.30 


14.65 


1638.23 


25600 


19.80 


9.90 


2525.25 


29.40 


14.70 


1700.68 


26624 


20.50 


10.25 


2536.59 


30.60 


15.30 


1699.35 


27648 


21.30 


10.65 


2535.21 


30.90 


15.45 


1747.57 


28672 


22.10 


11.05 


2533.94 


33.50 


16.75 


1671.64 


29696 


22.70 


11.35 


2555.07 


38.50 


19.25 


1506.49 


30720 


23.50 


11.75 


2553.19 


37.90 


18.95 


1583.11 


31744 


24.20 


12.10 


2561.98 


37.90 


18.95 


1635.88 


32768 


24.90 


12.45 


2570.28 


38.10 


19.05 


1679.79 


65536 


48.50 


24.25 


2639.18 


59.90 


29.95 


2136.89 



188 



TABLE E.4: LONG MESSAGES 



Message 


Node-to-Node 


Host-to-Node 


Length 


t 


te 


Rate 




te 


Rate 


(Bytes) 


(msec) 


(msec) 


(kbytes /sec) 


(msec) 


(msec) 


(kbytes/sec) 


131072 


95.60 


47.80 


2677.82 


109.40 


54.70 


2340.04 


150528 


109.60 


54.80 


2682.48 


123.60 


61.80 


2378.64 


161792 


117.70 


58.85 


2684.79 


131.60 


65.80 


2401.22 


162816 


118.40 


59.20 


2685.81 


132.90 


66.45 


2392.78 


163840 


119.10 


59.55 


26S6.82 


133.60 


66.80 


2395.21 


164864 


119.90 


59.95 


2685.57 


135.00 


67.50 


2385.19 


165888 


120.60 


60.30 


2686.57 


136.30 


68.15 


2377.11 


172032 


125.00 


62.50 


2688.00 


140.80 


70.40 


2386.36 


182272 


132.40 


66.20 


26S8.82 


148.10 


74.05 


2403.78 


192512 


139.70 


69.85 


2691.48 


155.60 


77.80 


2416.45 


202752 


147.10 


73.55 


2692.05 


164.60 


82.30 


2405.83 


223232 


161.80 


80.90 


2694.68 


181.10 


90.55 


2407.51 


243712 


176.50 


88.25 


2696.88 


194.80 


97.40 


2443.53 


253952 


183.80 


91.90 


2698.59 


202.80 


101.40 


2445.76 


259072 


187.60 


93.80 


2697.23 


205.50 


102.75 


2462.29 


262144 


189.70 


94.85 


2699.00 


210.50 


105.25 


2432.30 
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Figure E.5: Speed of Large Host-Node Messages 



a. Host-to-Node Performance 

The host-to-node communication rates (for large messages) are illus- 
trated in Figure E.5. 
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Figure E.6: Speed of Large Messages Between Nodes 



b. Node-to-Node Performance 

Figure E.6 shows the transmission rates for the same long messages 
when passed among nodes of the hypercube. To move the plot of Figure E.6 out 
into the open, a plot of transmission rate versus log 10 ^ is shown in Figure E.7. 
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Figure E.7: Node-to-Node Transmission Rates for Large Messages 
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D. CONCLUSIONS 



One of the obstacles that this experiment carefully avoided was competition 
for the links. Contention for communications resources may be inherent in certain 
parallel programs. Potential causes and effects of contention should always be given 
due consideration in the crafting of a parallel application. All of the algorithms that 
were tested in this research work involved very structured, regular communications 
schemes. An application with very random communication patterns should be ex- 
pected to behave very differently. Additionally, the communication scheme for every 
program in this work was designed to use the shortest possible path. 

The circuit switching approach has the disadvantage that a single message must 
control the entire path from origin to destination. Under a less controlled, random 
pattern of communications the performance of the communications subsystem might 
reasonably be expected to exhibit degraded performance. Other portions of this the- 
sis show that a communication-bound algorithm can experience severe performance 
degradation as well. There is no specific claim that the results obtained in this 
experiment represent an upper bound for node-to-node communications within the 
hypercube, but they are probably good estimates for an upper bound. 

Host-node communication is slower than node-to-node communication. This 
is not surprising (consider the physical distances and materials). In the absence of 
competition for the links, node-to-node transmission rates are essent ially predictable 
for a given message length. There is a tremendous rise in transmission rate as message 
length goes from one byte to the vicinity of twenty kilobytes. Thereafter, smaller 
(apparently asymptotic) performance gains are achieved by increasing the message 
size. A similar phenomenon occurs with host-node communications but it takes 
much longer messages to break, say the two megabytes-per-second transmission 
rate. 
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These performance measures are quite appealing for long messages, but con- 
sider transmissions of shorter (and possibly repetitious) messages. The data shows 
that short messages are penalized, even if they are part of a loop that involves a 
good deal of communication. Each instance of csend() or crecv() is distinct and 
incurs its own start-up cost. This is an important note for anyone considering 
transmission of the rows (or columns) of a matrix within a loop structure. The 
potential of (pre-transmission) storage of matrices (two-dimensional arrays) into 
one-dimensional arrays might be investigated as a means of increasing the commu- 
nications rate (provided the cost of copying the array is not prohibitive). 

Communications in a transputer network was not developed in this work, but 
Bryant [Ref. 26] gives a very thorough analysis of communications and calculations 
in a network of transputers. On pages 31-34, Bryant gives a good summary of 
unidirectional and bidirectional data transfer rates. He discusses link interaction (i.e., 
how communications performance varies as one, two, or all four of the transputer’s 
links are engaged in communication) on pages 34-38 and concludes that the effects 
of link interaction are minimal. 

Bryant also discusses the effects of varied communication loads on processor 
performance. On pages 38-44, he finds that bombarding a transputer with many 
small messages while it is trying to perform calculations can severely degrade the 
processor’s performance. His Figures 3.8 and 3.9 show that — with only one link 
active — messages of size 100 bytes and larger cause negligible performance degrada- 
tion. With all four links active, messages of size greater than one kilobyte should be 
used to free the processor from most of the communications overhead. 

Pages 36 and 37 of Bryant’s thesis show' the effects of message length on the 
communication rate. Bryant’s Figures 3.4 and 3.5 are quite similar to Figure E.6 
above, but the transputers are much more responsive (i.e., there seems to be less 
overhead involved, so the peak communications rate is achieved much earlier). In 
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fact, the transputers are near their peak transmission rate with messages of 100 bytes 
and messages of one kilobyte and greater always travel at peak rates. 

Comparing a transputer system to an iPSC/2 system — in terms of communi- 
cations performance — is essentially a lesson in the differences between store-and- 
forward switching versus circuit switching for multi-hop communications. Bryant 
shows [Ref. 26 : pp. 83-85] that the store-and-forward transmission rates suffer as 
the number of hops grows. The direct-connect (circuit switching) approach recovers 
its overhead on multi-hop communications, but it ties up the entire path to do so 
(making it unavailable to other potential users). The key difference is that commu- 
nications performance with the direct-connect method is very nearly independent of 
the number of hops. 

The transputer system seems to enforce true blocking communications on both 
the sending and receiving ends (byte-by-byte acknowledgment is part of the pro- 
tocol). The iPSC/2 csendQ is not blocking, but the crecv() function is blocking. 
Proper handling of these issues can become important when implementing an algo- 
rithm. Each method has advantages and disadvantages, but — at least for the current 
systems — transputers seem better suited for applications involving short messages 
over short distance and the iPSC/2 seems to handle long messages over long distances 
better. 

E. SOURCE CODE LISTINGS 

The source code listings for the programs used for these tests are supplied on 
the pages that follow. The makefile commtst.mak appears first and describes the 
dependencies among the files and compilation procedures. Next, commtst.h is the 
header file associated with these programs. Finally, the actual code is given in a host 
program called commtst.c and the node program commtstn.c. 
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comrritst.niak 



1 # Author: Jonathan E. Hartman, U. S. Haval Postgraduate School 

2 # Purpose: Makefile for Eypercube Communications Test Programs 

3 # Date: 07 August 1991 

4 

5 all: hostcode nodecode 

6 

7 help: 

8 chelp 

9 

10 

11 # 

12 hostcode: commtst.o clargs . o 

13 cc clargs. o commtst.o -host -o commtst 

14 

15 clargs. o: clargs. h clargs. c 

16 commtst.o: commtst. h commtst. c 

17 

18 

19 # 

20 nodecode: commtstn.o 

21 cc commtstn.o -node -o commtstn 

22 

23 commtstn.o: commtstn. c commtst. h 

24 

25 

26 # Execute it! 

27 run: all 

28 commtst -d 3 -b 1024 -r 2 

29 

30 

31 # Delete object files, executables 

32 clean: 

33 rm * . o 

34 rm commtst 

35 rm commtstn 

36 

37 # EOF commtst . raa_k 
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commtst.h 



1 /* = = = = = = = = = = PROGRAM INFORMATION = = ===== = == 

2 ♦ 

3 * SOURCE : commtst.h 

4 ♦ VERSION : 1.2 

5 * DATE : 07 August 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. laval Postgraduate School 

7 * 

8 * 

9 * ============== DESCRIPTION ============== 

io * 



11 * This header file gives common information for use across the host program 

12 * commtst.c and the node program commtstn.c. A more complete description 

13 * can be found in commtst.c. 

14 * 

16 */ 

17 



18 


#if ndef 


EXIT.FAILURE 




19 


#def ine 


EXIT.FAILURE 


-1 


20 


#endif 






21 








22 


#def ine 


MAX.CUBESIZE 


16 


23 








24 


#def ine 


ROOT 


-1 


25 








26 


#def ine 


RECEIVE 


0 


27 


#def ine 


SEND 


1 


28 








29 


#def ine 


FALSE 


0 


30 


#def ine 


TRUE 


1 



31 

32 

33 /* ==== = ======= TYPE DEFINITION = =========== 

34 * 

35 * The following structure is the framework that the root processor (host) 

36 * uses to pass instructions to the worker nodes in the cube. 

37 */ 

38 

39 typedef struct { 

40 

41 int task; /* choose RECEIVE or SEND as above */ 

42 long bytes; /* length of message */ 

43 long reps; /* number of repetitions */ 

44 int destination[MAX_CUBESIZE] ; /* for senders: identifies addressees */ 

45 

46 } Tasking; 

47 

48 

49 /* ============ EOF commtst.h ============ */ 
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1 / 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 
27 



======= program information 



SOURCE : 


commtst . c 


VERSION : 


1.2 


DATE : 

AUTHOR : 


07 August 1991 

Jonathan E. Hartman, U. S. Naval Postgraduate School 


USAGE : 


commtst [-d dimension] [-b bytes] [-r repetitions] [-v] 


EXAMPLE : 


If you type 'commtst -d 3 -v -b 1024 -r 10', it means to 
run the program on a dimension 3 hypercube in the verbose 
mode, with messages of length 1024 bytes, and 10 repeti- 
tions for each message. 


REFERENCES : 


[1] iPSC/2 Programmer's Reference Manual 



============== DESCRIPTION 



This program runs on the host. It orchestrates various point-to-point 
communication tasks between nodes of a hypercube. The time o f round-trip 
communications is gathered and printed out. The output includes the time 
required and rate of communication (taking into account repetitions and 
round-trips). The 'verbose 1 mode gives a more detailed node-by-node 
accounting of the run. 
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29 



/ 



30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 



char *version = "Hypercube Communications Test, Version 1.2"; 



ALGORITHM 



The root (host) processor determines who will communicate with whom, and 
when. No node operates independently. The host identifies a sender and 
receiver(s). The host also gives the length of the message that should 
be passed and the number of times that the message is to be repeated 
(multiple repetitions may be required when the message is short since 
mclock() returns milliseconds). The 'Tasking' structure holds instruc- 
from the manager (i.e., SEND or RECEIVE, the length of the message, num- 
ber of repetitions, and addressees). When this structure is received at 
a node, it performs the task and awaits further instructions from the 
manager processor. If the processor is a sender, it returns timing data 
to the host upon completion. 
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commtst.c 



si #include <stdio.h> 

52 #include "commtst .h" 

53 #include "ipsc.h" 

54 #include "macros. h" 

55 finclude ’'clargs.h” 

56 



57 


#def ine 


A SC I I .CONVERSION 


48 


/♦ 


for char -> int conversion of 0. . 


, .3 */ 


58 

59 

60 


#def ine 


CT.SIZE 


4 


/* 


for cubetype □ size 


*/ 


#def ine 


NUM.ARGS 


4 


/* 


-d -b -r -V 


*/ 


61 


#def ine 


DIM 


0 


/* 


index values into optv[] 


*/ 


62 


fdefine 


BYTES 


1 








63 


#def ine 


REPS 


2 








64 


#def ine 


VERBOSE 


3 









65 

66 

67 /♦ ===== ========= FUNCTION DEFINITION ============== ♦ / 

68 

69 #ifdef PROTOTYPE 



70 

71 void init(int argc, char ♦♦argv, char cubetype [CT.SIZE] , 

72 int ♦dim, long ♦bytes, long ♦reps, int ♦verbose) 

73 

74 #else 

75 

76 void init(argc, argv, cubetype, dim, bytes, reps, verbose) 

77 



78 

79 

80 
81 
82 

83 

84 

85 

86 #endif 

87 { 

88 

89 

90 

91 

92 

93 

94 

95 

96 

97 

98 

99 
100 



int argc; 
char ♦♦argv, 

cubetype [CT_SIZE] ; 
int ♦dim; 
long ♦bytes, 

♦reps ; 

int ♦verbose; 



int count = 1, 

valid = FALSE; 

0pt_Struct ♦optv [NUM.ARGS] ; 



/* 

* 

* 

* 

♦ 

* 

*/ 



The first step is to make a table of all of the valid arguments. The 
structure is defined more carefully in clargs.h, but the basic idea is 
that we have an array of pointers to type Opt.Struct (option structure) 
...in this case, there axe NUM.ARGS valid arguments and the next few 
steps take care of allocation and definition of them. When this is 
done, it is time to call interpret.args () to see what the user entered. 
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conimtst.c 



101 

102 

103 

104 

105 

106 
107 
106 
109 

no 

in 

112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 
126 

129 

130 

131 

132 

133 

134 

135 

136 

137 
136 

139 

140 

141 

142 

143 

144 

145 

146 

147 
146 

149 

150 



optv[DIM] = (Opt_Struct ♦ ) calloc( 1, 
optv [BYTES] = (Opt_Struct ♦ ) calloc( 1, 
optv[REPS] = (Opt_Struct ♦ ) calloc( 1, 
optv [VERBOSE] = (0pt_Struct ♦ ) calloc( 1, 
optv [DIM] ->lanswer = (long ♦ ) calloc( 1, 
optv [BYTES] ->lanswer = (long ♦ ) calloc( 1, 
optv [REPS] ->lanswer = (long ♦ ) calloc( 1, 



sizeof (Opt_Struct) 
sizeof (Opt.Struct) 
sizeof (Opt_Struct) 
sizeof (Opt_Struct) 
sizeof (long) ); 
sizeof (long) ) 
sizeof (long) ) 



) 

) 

) 

) 



/♦ The intel compiler 
optv [DIM] ->argname [0] 
optv [DIM] ->argname [l] 
optv [DIM] ->subargc 
optv [DIM] ->subargi 



didn't like . . .->argname 

= > - > • 
t 

= 'd'; 

* 1 ; 

= IEXT.L0IG; 



M -d"; etc. ♦ / 



optv [BYTES] ->argname [0] 
optv [BYTES] ->argname [1] 
optv [BYTES] ->subargc 
optv [BYTES] ->subargi 



i 

= ’b'; 

= l; 

= HEXT.L0NG; 



optv [REPS] ->argname [0] 
optv [REPS] ->argname [1] 
optv [REPS] ->subargc 
optv [REPS] ->subargi 



* 

= 'r'; 

= 1; 

= IEXT_L0HG ; 



optv [VERBOSE] ->argname[0] = 
optv [VERBOSE] ->argname [1] = ' v'; 

optv [VERBOSE] ->subargc = 0; 



♦dim = -1; 



interpret_args(argc , argv, IUM_ARGS, optv); 

if (optv [DIM] ->found) ♦dim = (int) optv [DIM] ->lanswer [0] ; 

switch (♦dim) { 



case 0 : case 1 : case 2 : case 3 : break; 

default : 

while (ivalid) { 

printf ("Enter desired cube dimension (in {0, 1, 2, 3>) : "); 
scanf ("*/*d" , dim); 
f f lush(stdin) ; 
switch(^dim) { 

case 0 : case 1 : case 2 : case 3 : valid = TRUE; break; 

> 

> 

} /♦ end switchO ♦/ 
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151 

152 

153 

154 

155 

156 

157 
156 

159 

160 
161 
162 

163 

164 

165 

166 
167 
166 

169 

170 

171 

172 

173 

174 

175 

176 

177 
176 
179 
160 
161 
162 

163 

164 

165 

166 
167 
166 
169 



if (optv [BYTES] ->found) ♦bytes = optv [BYTES] ->lanswer [0] ; 

valid = FALSE; 

i f (♦bytes < 1) { 

while (!valid) { 

printf ("Enter message length (bytes): "); 
scant ("'/.Id" , bytes); 
f f lush(stdin) ; 

if (kbytes > 0){ valid = TRUE; > 

else { printf ("Message length must be positive . \n") ; } 

> 

> 

if (optv [REPS] ->found) { ♦reps = optv [REPS] ->lanswer [0] ; } 
else { 

printf ("Non-existing (or invalid) repetition count, "); 
printf ("using one repetition. \n\n") ; 

♦reps = 1; 

> 

(optv [VERBOSE] ->found) ? ♦verbose = TRUE : ♦verbose = FALSE; 

cubetype[0] = ; d 1 ; /♦ for dimension (to follow) ♦/ 

cubetype [l] = (char) (♦dim + ASCII.C0NVERSI0N) ; 

cubetype[2] = 'f*; /♦ means nodes are 386/387 combo ♦/ 

cubetype [3] = 0; 

printf ("Initialization complete ... Cube Dimension: y*d\n", ♦dim) ; 

printf (" Message Length: y,ld\n", ♦bytes); 

printf (" Repetitions: %ld\n\n M , ♦reps); 

if (♦verbose) printf (" Verbose Mode: ON"); 

> 

/♦ End init() ♦/ 



#if def PROTOTYPE 



190 

191 main(int argc, char *argv[]) 

192 

193 #else 

194 

195 main(argc, argv) 

196 

197 int argc; 

196 char *argv[]; 

199 

200 #endif 
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commtst.c 



201 { /* begin main() */ 

202 

203 chaj *cubename = "Hypercube" , 

204 cubetype [CT_SIZE] , 

205 *msg, 

206 ♦nodecode = M commt8tn H ; 

207 

206 float avg, 

209 avg.hostrate , 

210 avg_hosttime , 

211 avg.rate, 

212 avg_time f 

213 bytes, 

2H reps; 

215 

216 int cubesize, 

217 dim, 

216 i , 

219 j , 

220 verbose; 

221 

222 unsigned long ♦♦timing_data; 

223 

224 Tasking task_packet; 

225 

226 

227 printf ( M \n'/.s\n\n H , version); 

226 

229 init(argc, argv, cubetype, Adim, A(task_packet .bytes) , 

230 &(task_packet .reps) , Averbose) ; 

231 

232 bytes = (float) task.packet .bytes ; 

233 reps = (float) task_packet . reps ; 

234 bytes ♦= ( 2.0 ♦ reps); /* account for two-way communications, reps */ 

235 

236 cubesize = P0V2(dim); 

237 

236 timing_data = (unsigned long ♦♦) calloc (cubesize , sizeof (unsigned long*)); 

239 

240 for (i = 0 ; i < cubesize; i++) { 

241 

242 timing.data [i] = (unsigned long*) calloc(cubesize , sizeof (unsigned long)); 

243 > 

244 

245 if ( ! (msg = (chaLT *) calloc (task.packet . bytes , sizeof (char) )) ) { 

246 

247 printf ( M main() : Allocation failure for msg.W); 

246 exit (EX IT. FAILURE) ; 

249 > 

250 
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251 

252 

253 

254 

255 

256 

257 
256 

259 

260 
261 
262 

263 

264 

265 

266 
267 
266 

269 

270 

271 

272 

273 

274 

275 

276 

277 
276 
279 
260 
261 
262 

263 

264 

265 

266 
267 
266 
269 

290 

291 

292 

293 

294 

295 

296 

297 
296 

299 

300 



/* Get the cube and load the node code */ 

getcube (cubename , cubetype , BULL, 0); 
attachcube(cubename) ; 
setpid(O) ; 

load(nodecode, ALL_N0DES , S0DE_PID) ; 



/* Perform the tasking, receive the message, return it, receive and print 

* timing data. . .repeat for all players. The outer loop index, i, will 

* represent the sender node. The j index runs the other (RECEIVE) 

* players. 

*/ 

for (i = 0; i < cubesize; i++) { 

/* Get the receivers ready first */ 

task_packet . task = RECEIVE; 

task_packet .destination[0] = i; 

task_packet .destination[l] = cubesize; /* impossible flags end */ 
for (j = 0; j < i; j++) { 

csend(0, &task_packet , sizeof (Tasking) , j, N0DE_PID) ; 

> 

for (j = (i+l); j < cubesize; j++) { 

csend(0, &task_packet , sizeof (Tasking) , j, N0DE_PID) ; 

> 

/* Then prepare the sender ==> he can start */ 
task_packet . task = SEND; 

for (j = 0; j < i; j++) task_packet . destination [j] = j; 

task.packet . destination [i] = ROOT; 

for (j = (i+1); j < cubesize; j++) task_packet .destination [j] = j; 

csend(0, fttask_packet , s izeof (Tasking) , i, N0DE_PID) ; 

/* Receive from the sender and return his message */ 
for (j = 0; j < task_packet .reps; j++) { 

crecv(ANY_TYPE, msg, task_packet .bytes) ; 
csend(0, msg, task.packet .bytes , i, NODE.PID) ; 

> 

/* Receive the timing data from this run and print it */ 
crecv(ANY_TYPE, timing_data[i] , (cubesize * s izeof (unsigned long)) ); 

} /* end for (i) */ 
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301 

302 

303 

304 

305 

306 

307 
306 

309 

310 

311 

312 

313 

314 

315 

316 

317 
316 

319 

320 

321 

322 

323 

324 

325 

326 

327 
326 

329 

330 

331 

332 

333 

334 

335 

336 

337 
336 

339 

340 

341 

342 

343 

344 

345 

346 

347 
346 

349 

350 



for (i = 0; i < cubesize; i++) { 
if (verbose) { 

printf ("Source Dest. Time (msec) Rate (kilobytes/second) \n”) ; 

printf ("====== ===== =========== =======================\n") ; 

printf ( n, /.4d HOST '/,101u ", i, timing_data[i] [i] ) ; 

printf (" y,10.2f\n", (bytes / ((float) timing_data[i] [i] ) ) ); 



avg = 0.0; 

for (j = 0; j < cubesize; j++) { 

if (i != j) { 

avg += (float) timing_data[i] [j] ; 
if (verbose) { 

printf (" '/.4d M , j); 

printf (" V.lOlu ”, timing_data[i] [j] ) ; 

printf (”'/, 10. 2f\n", (bytes / ((float) timing_data[i] [j] )) ); 

> 

> 

if (j == (cubesize - 1)) { 

avg /= (float) cubesize - 1; 
if (verbose) { 

printf (-============================================”); 

printf ("==========\n”) ; 

printf ("Averages */.9.1f msec ", avg); 

printf (” y.7.2f'\ bytes/avg ); 

printf (” kbytes/sec\n\n\n") ; 

> 

> 

} /* end for(j) */ 

> /* end for(i) ♦/ 



for (i = 0; i < cubesize; i++) { 



for (j = 0; j < cubesize; j++) { 




(i == j) ? avg_hosttime += timing. data[i] [j] : 
avg.time += t iming_data[i] [j] ; 
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351 avg.hosttime /= cubesize; 

352 avg.hostrate = bytes/avg_hosttime ; 

353 

354 avg_time /= ((cubesize - 1) * cubesize); 

355 avg_rate = byt es/avg_time ; 

356 

357 printf ("If we average all of the tines and rates .... \n\n" ) ; 

356 printf (" Average Time: */,9.1f milliseconds\n" , avg_time) ; 

359 printf (" Average Rate: '/,10.2f kilobytes/second\n\n\n" , avg_rate) ; 

360 

361 printf ("NOTE: Average and Rate values are for the nodes ONLYAn"); 

362 printf (" They do not include the host timing data An\n\n") ; 

363 

364 printf ("The averages for the node < — > host communications were:\n\n"); 

365 printf (" Average Time: 7*9. If milliseconds\n" # avg_hostt ime) ; 

366 printf (" Average Rate: 7,10. 2f kilobytes/second\n\n\n" , avg.hostrate) ; 

367 
366 > 

369 /♦ ============ EOF commtst.c ============ */ 
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1 / + . ========== PROGRAM INFORMATION ========== 

2 * 

3 * SOURCE : commtstn.c 

4 ♦ VERSION : 1.2 

5 ♦ DATE : 07 August 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 ♦ 

6 ♦ 

*9 * ============== DESCRIPTION ============== 

10 * 



11 * This program is loaded by commtst.c (which runs on the host). This code 

12 * (commtstn.c) runs on the nodes of a hypercube created by the host program. 

13 * For more information, see commtst.c. 

14 ♦ 

16 */ 

17 

16 

19 #include <stdio.h> 

20 #include "commtst.h" 

21 #include "lpsc.h" 

22 

23 #def ine SUCCESS 0 

24 

25 

26 

27 #if def PROTOTYPE 

26 

29 main(int argc, char *argv[]) 

30 

31 #else 

32 

33 main(argc, argv) 

34 

35 int argc; 

36 char *argv[]; 

37 #endif 
36 { 

39 char *msg; 

40 

41 int cubesize = numnodesQ, 

42 i, 

43 j , 

44 return_addr; 

45 

46 long rep; 

47 

46 unsigned long start, ♦timing_data; 

49 

50 Tasking task_packet; 
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si 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 
62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

75 

76 

77 
7e 

79 

80 
81 
82 

83 

84 

85 



86 

87 

88 

89 

90 

91 

92 

93 

94 

95 
66 
97 
68 

96 
100 



timing.data = (unsigned long*) calloc(cubesize , sizeof (unsigned long)); 
for (i = 0; l < cubesize; i + O { 

crecv ( ANY.TYPE , Atask.packet , sizeof (Tasking) ) ; 

msg = (char ♦) calloc ( task.packet . by tes , sizeof (char )) ; 

Bwitch ( task.packet . task) { 
case RECEIVE : 

return.addr = task. packet . destination [0] ; 

for (rep = 0; rep < task.packet . reps ; rep++) { 

crecv (ANY_TYPE , msg, task.packet . bytes) ; 

csend(0, msg, task_packet . bytes , return_addr, H0DE.PID); 

> 

break ; 



case SEND : 

j = 0; 

while (( j<cubesize)Alt (task.packet . destination [j] <cubesize) ) { 
Btart = mclock ( ) ; 

for (rep = 0; rep < task.packet . reps ; rep+O { 

(j == mynodeO) ? 

csend(0 ,meg , task .packet .bytes ,myhost () , H0DE.PID) : 
csend(0, msg, task.packet . bytes , j, N0DE.PID); 

crecv (AHY.TYPE , asg, task. packet . bytes) ; 

> 

timing_data[ j] = mclockQ - start; 

> 

/♦ Return the timing data ♦/ 

csend(0, timing.data, (cubesize ♦ sizeof (unsigned long)), 
myhostO, HODE.PID); 



break ; 
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101 default : 

102 

103 printf ("Unrecognized task at node V.ld.Vn", mynodeQ ); 

104 exit (EXIT_FAILURE) ; 

105 

106 } /* end switchO */ 

107 
106 

109 free(msg); 

110 

111 

112 } /* end for() */ 

113 

114 return(SUCCESS) ; 

115 

116 > 

117 /* =========== eof commtstn.c =========== */ 
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APPENDIX F 



MATRIX LIBRARY 

This appendix contains part of the matrix library, matlib that is often used 
and referenced in other sections and code. It could be argued that “matrix library r 
is a misnomer since much of the code has little to do with matrices. This criticism is 
true, but I will defend the name since the entire reason for the creating such a library 
was to handle matrices in a more reasonable way. The last section of this appendix 
contains all of the source code for Gauss factorization with partial pivoting, and a 
short excerpt from the complete pivoting code. 

The specifications and a portion of the source code for the library are given on 
the pages to follow. The original intent was to include the source code in its entirety, 
but this would require more than double the current number of pages so the source 
has been omitted. The files are divided into three logical groups: 

1. Makefiles that simplify maintenance of the library, show dependencies among 
the files, and describe the compilation procedures that are used to generate the 
loadable (executable) code. 

2. Standard files (mostly C header files) that make definitions available (for con- 
sistency) across a wide range of files. The range is implied by the content of 
the file. These files include manifest constants that are installed using the C 
Preprocessor #def ine directive, type definitions that are intended for use across 
several files, and macro definitions that are expanded by the C Preprocessor. 

3. Source code files that appear in pairs, like filename. h and filename. c or (mostly) 
as a header file alone. The header file gives remarks, definitions of manifest con- 
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slants, t y | m* definitions, ami function declarations (specifications) that pertain to 
I he* associated source* code (i.e* M the code within filename. c). Again, the latter 
has been omitted in most cases. 

4. The (Jauss factorization code. All of the source* code for the partial pivoting 
version is given, and an excerpt of the pivot election function from the.* complete 
pivoting code* is also provided. 

A. MAKEFILES 

logc.mnk This makefile is a standard template for programs compiled with the 
Logical Systems (' (version 8!L I ) product. 

matlih.mnk I'll is makefile is used to translate matlib into a useable form. With 
Logical Systems ( \ it creates a library suitable for installation and use as any 
other normal (’ library. I lie portion of the makefile used on the Intel iPSC/2 
simply works in tin* current directory to translate* the source, into object code* so 
that e )t I ie *i programs can refe*re*nee* it. 



210 



logc.mak 



1 # 

2 # 

3 # AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

4 # PURPOSE : Makefile for Hypercube Communications Test Programs (LogC) 

5 # DATE : 10 August 1991 

6 # 

7 # 

8 

9 R00TC0DE=f ilename 

10 N0DEC0DE=f ilename 

11 NIF_FILE=f ilename 



12 

13 

14 # OPTIONS AND DEFINITIONS 

15 # 

16 # The following section establishes various options and definitions. Ve 

17 # start with PP, the Logical Systems C Preprocessor. The '-dX' option 

is # (with no macro_expression) is like 'fcdefine X 1 ' . Next we set up the 

19 # compilation options for Logical Systems' TCX Transputer C Compiler. The 

20 # '-c' means compress the output file. The options beginning with *-p' 

21 # tell TCX to generate code for the appropriate processor: 

22 # 

23 # -p2 T212 or T222 

24 # -p25 T225 

25 # -p4 T414 

26 # -p45 T400 or T425 

27 # -p8 T800 

28 # -p85 T80 1 or T805 

29 # 

30 # Logical Systems' TASM Transputer Assembler is next. The '-c' means 

31 # compress the output file (it can cut it in half)! The # -t' is used 



32 # because the input to TASM will be from a language translator (TCX's 

33 # output) and not from assembly source code. 

34 # 

35 # The final list tells TLNK which libraries to look at during linking. 

36 # It also establishes an entry point. You should always use _main for 

37 # the root node; otherwise use _ns_main (for other nodes). 

38 

39 PP0PT2=-dPR0T0TYPE -dTRANSPUTER -dT212 

40 PP0PT4=-dPR0T0TYPE -dTRANSPUTER -dT414 

41 PP0PT8=-dPR0T0TYPE -dTRANSPUTER -dT800 

42 TCX0PT2=-cp2 

43 TCX0PT4=-cp4 

44 TCX0PT8=-cp8 

45 TASM0PT=-ct 

46 T2LIB=t21ib . til 

47 T4LIB=matlib4.tll t4cube.tll t41ib.tll 

48 T8LIB=matlib8 . til t8cube.tll t81ib.tll 

49 RENTRY=_main 

50 NENTRY=_ns_main 
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logc.mak 



51 

52 

53 # DEFAULT ===> MAKE ALL 

54 # 

55 

56 all: S(ROOTCODE) .tld $ (NODECODE) . tld 

57 
56 

59 

60 
61 

62 # ROOT CODE 

63 # 

64 

65 S(ROOTCODE): $ (R00TC0DE) . tld 

66 

67 $(R00TC0DE) .tld: $ (R00TC0DE) . trl 



66 


echo FLAG 


c 


> 


$ (R00TC0DE) . Ink 


69 


echo LIST 


$(R00TC0DE) .map 


» 


$ (R00TC0DE) . Ink 


70 


echo INPUT 


$(R00TC0DE) .trl 


» 


$(R00TC0DE) .Ink 


71 


echo ENTRY 


$(REHTRY) 


>> 


$(R00TC0DE) .Ink 


72 


echo LIBRARY $(T4LIB) 


>> 


$(R00TC0DE) .Ink 


73 

74 


tin* $(R00TC0DE) .Ink 







75 $(R00TC0DE) .trl: $(R00TC0DE) . c 

76 pp $(R00TC0DE) . c $(PP0PT4) 

77 tcx $(R00TC0DE) .pp $(TCX0PT4) 

76 tasm $(R00TC0DE) .tal $(TASM0PT) 

79 

60 

61 

62 

63 

64 # NODE CODE 

65 # 

66 

67 S(NODECODE) : $(NODECODE) . tld 



66 

69 $(N0DEC0DE) .tld: $(N0DEC0DE) . trl 



90 


echo FLAG 


c 


> 


$ (HODECODE). Ink 


91 


echo LIST 


$ (HODECODE). map 


» 


$( HODECODE) .Ink 


92 


echo INPUT 


$ (HODECODE) .trl 


» 


$ (HODECODE). Ink 


93 


echo ENTRY 


S(HEHTRY) 


» 


$( HODECODE) .Ink 


94 


echo LIBRARY $(T8LIB) 


>> 


t(HODECODE) .Ink 


95 

96 


tlnk $ (HODECODE) . Ink 







97 S(NODECODE) .trl: $ (NODECODE) . c 
96 pp $(N0DEC0DE) . c $(PP0PT8) 

99 tcx $ (NODECODE) .pp *(TCX0PT8) 

ioo tasm $(N0DEC0DE) .tal $(TASM0PT) 
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EXECUTION 



logc.mak 



101 

102 # 

103 # 

104 

105 run: $(R 00 TC 0 DE) . tld $ (N 0 DEC 0 DE) . tld $ (NIF.FILE) . nif 

106 ld-net S(NIF.FILE) 

107 
106 

109 # CLEAN UP 

no # 



in 



H 2 clean: 



113 

114 

115 

116 
117 
116 

119 

120 
121 
122 



del *(R00TC0DE) .Ink 
del $ (NODECODE) .Ink 
del $(R00TC0DE) .map 
del S(NODECODE) .map 
del $(R00TC0DE) .tal 
del $(N0DEC0DE) .tal 
del $(R00TC0DE).pp 
del $(N0DEC0DE) .pp 
del $(R00TC0DE) .trl 
del $(N0DEC0DE) .trl 



123 



124 

125 # EOF logc.mak 



213 




matlib.niak 



1 « === = = ===== MAKEFILE FOR MATRIX LIBRARY ====== = = == 

2 # 

3 t SOURCE : matlib.mak 

4 # DATE 17 August 1991 

5 # AUTHOR : Jonathan E. Hartman, U. S. laval Postgraduate School 

6 « 

7 # PURPOSE : Make the matrix library 'matlib' . 

6 # 

9 # REMARKS : This makefile works with Logical Systems C, version 89.1, 

10 # and the Intel iPSC/2 compiler. The LogC portions of this 



11 # makefile actually construct libraries of the functions available in the 

12 # source files indicated. There axe two libraries generated — matliM.tll 

13 # k matlib8.tll since the code is compiled for T414 or T800 processors. 

14 # For the Intel compiler, I have not created a library; but have used the 

is # object code as needed. There are a few sections that pertain to both 

16 # compilers. The sections that only pertain to a particular compiler are 

17 # clearly marked ‘Intel iPSC/2' or 'Logical Systems C'. 

18 # 

20 

21 

22 

23 

24 

25 # ========== 1.) definitions and options ========== 

26 # 

27 # The following options and definitions are required. A more thorough 

28 # explanation can be found in ‘logc.mak' or in the Logical Systems C 

29 # Transputer Toolset manual. 

30 # 

32 

33 THISMAKEFILE=matlib .mak 

34 

35 

36 # =============== l.i) Intel iPSC/2 =============== 

37 # 

38 

39 # MATLIBDIR is the directory that contains the matlib files 

40 MATLIBDIR = /usr/hartman/matlib 

41 OBJECTS = clargs.o comm.o hcube.o generate. o mat_ops.o matrixio.o memory. o math.o 
sep.o timing. o vec_ops.o 

42 

43 

44 

45 

46 # ============ 1.2) Logical Systems C ============ 

47 # 

48 

49 T414LIBNAME=matlib4 
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inatlih.inak 



so T800LIBNAME=matlib8 

51 

52 TRL4FILES=clargs . tr!4 comm trl4 complex trl4 generate trl4 machine . trl4 mat_ops.trl4 
math.trl4 matrixio . trl4 memory. trl4 num.Bys.trl4 »ep.trl4 timing. trl4 vec_ops.trl4 

53 TRL8FILES=clargs . trl8 comm.tr!8 complex. tr!8 generate . trl8 machine. trl8 mat.ops.trl8 
math.trl8 matrixio . trl8 memory. trl8 num_sys.trl8 sep.trl8 timing. trl8 vec.ops.trl8 

54 

55 TLIB4FILES=clarg8 comm complex generate machine mat.ops math matrixio memory num.sys 
sep timing vec.ops 

56 TLIB8FILES=clargs comm complex generate machine mat.ops math matrixio memory num.sys 
sep timing vec.ops 

57 

56 PP0PT2=-dPR0T0TYPE -dTRANSPUTER -dT212 

59 PP0PT4=-dPR0T0TYPE -dTRANSPUTER -dT414 

60 PP0PT8=-dPR0T0TYPE -dTRANSPUTER -dT800 

61 

62 TCX0PT2=-cp2 

63 TCX0PT4=-cp4 

64 TCX0PT8=-cp8 

65 

66 TASMOPT=-ct 

67 

68 T2LIB=t21ib .til 

69 T4LIB=matlib4 . til t4cube.tll t41ib.tll 

70 T8LIB=matlib8 . til t8cube.tll t81ib.tll 

71 

72 RENTRY=_main 

73 NENTRY=.ns.main 

74 

75 

76 

77 
76 

79 # ======= 2.) INSTRUCTIONS FOR DEFAULT MAKE === ==== 

80 # 

61 # The following sections give the default (since they appear first in the 

82 # makefile) options for this makefile. By commenting one or the other 

83 # out, one can get to the defaults easily. 

84 # 

66 

67 ipse: imatlib 

88 clean: ielean 

89 # tptr: tmatlib 

90 # clean: tclean 

91 

92 

93 # =============== 2.1) Intel iPSC/2 ============= == 

94 # 

95 
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96 imatlib: $(OBJECTS) 

97 
96 
99 

100 # ======= ===== 2.2) Logical Systems C = =========== 

101 # 

102 # Make everything and install in the library directory designated by the 

103 # environment variable TLIB. 

104 

105 

106 tmatlib: 

107 make -f $ (THISMAKEFILE) $ (T414LIBNAME) . til 
106 make -f $ (THISMAKEFILE) install4 

109 make -f $ (THISMAKEFILE) tclean 

no make -1 $ (THISMAKEFILE) $ (T800LIBHAME) . til 

111 make - t $ (THISMAKEFILE) install8 

112 make -1 $ (THISMAKEFILE) tclean 

113 make -f $ (THISMAKEFILE) install.headers 

114 

115 

116 # CREATE T414 VERSION OF THE LIBRARY 

117 
116 

119 $(T414LIBNAME) .til : $(TRL4FILES) 

120 tlib $(T414LIBNAME) -b $ (TLIB4FILES) 

121 

122 clargs.trl4 : clargs.h clargs.c 

123 pp clargs.c $(PP0PT4) 

124 tcx clargs.pp $ (TCX0PT4) 

125 tasm clargs.tal $(TASM0PT) 

126 

127 comm.trl4 : comm.h comm.c 
126 pp comm.c $ (PP0PT4) 

129 tcx comm . pp $(TCX0PT4) 

130 tasm comm.tal $(TASM0PT) 

131 

132 complex. trl4 : complex. h complex. c 

133 pp complex. c $(PP0PT4) 

134 tcx complex. pp $(TCX0PT4) 

135 tasm complex. tal $(TASM0PT) 

136 

137 generate. trl4 : generate. h generate. c matrix. h memory. trl4 

136 pp generate. c $(PP0PT4) 

139 tcx generate. pp $(TCX0PT4) 

140 tasm generate. tal $(TASM0PT) 

141 

142 hcube.trl4 : hcube.h hcube.c 

143 pp hcube.c $(PP0PT4) 

144 tcx hcube.pp $(TCX0PT4) 

145 tasm hcube.tal S(TASHOPT) 
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146 

147 
146 

149 

150 

151 

152 

153 

154 

155 

156 

157 
156 

159 

160 
161 
162 

163 

164 

165 

166 
167 
16 6 

169 

170 

171 

172 

173 

174 

175 

176 

177 
176 
179 
160 
161 
162 

163 

164 

165 

166 
167 
166 
169 

190 

191 

192 

193 

194 

195 



machine. trl4 machine h machine. c 
pp machine. c $(PP0PT4) 



tcx machine. pp 
tasm machine . tal 



$ (TCX0PT4) 
$ (TASMOPT) 



mat_ops.trl4 : mat_ops.h mat_ops.c matrix. h 
mat.ops.c $ (PP0PT4 ) 



PP 

tcx mat_ops . pp 
tasm mat_ops.tal 



$ (TCX0PT4) 
$ (TASMOPT) 



math.trl4 : math.h math.c 
pp math.c SCPP0PT4) 

tcx math . pp $(TCX0PT4) 

tasm math. tal S(TASMOPT) 

matrixio . trl4 : matrixio.h matrixio.c ascii. h matrix. h memory. trl4 
pp matrixio.c $(PP0PT4) 

tcx matrixio. pp $(TCX0PT4) 

tasm matrixio. tal S(TASMOPT) 

memory. trl4 : memory. h memory. c matrix. h 
pp memory. c $(PP0PT4) 

tcx memory. pp $(TCX0PT4) 

tasm memory. tal S(TASMOPT) 

num.sys . trl4 : num_sys.h num_sys.c matrix. h 
pp num_sys.c $(PP0PT4) 

tcx num_sys.pp $(TCX0PT4) 

tasm num.sys.tal S(TASMOPT) 

sep.trl4 : sep.h sep.c 
pp sep.c $ (PP0PT4) 

tcx sep.pp $(TCX0PT4) 

tasm sep.tal $(TASM0PT) 

timing. trl4 : timing. h timing. c 
pp timing. c $(PP0PT4) 

tcx timing. pp $(TCX0PT4) 

tasm timing. tal $(TASM0PT) 

vec_ops.trl4 : vec.ops.h vec.ops.c 
pp vec.ops.c $ (PP0PT4) 

tcx vec.ops.pp S (TCX0PT4) 

tasm vec_ops.tal $(TASM0PT) 



CREATE T800 VERSION OF THE LIBRARY 
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196 

197 
196 

199 

200 
201 
202 

203 

204 

205 

206 
207 
206 

209 

210 
211 
212 

213 

214 

215 

216 
217 
216 

219 

220 
221 
222 

223 

224 

225 

226 
227 
226 

229 

230 

231 

232 

233 

234 

235 

236 

237 
236 

239 

240 

241 

242 

243 

244 

245 



$(T800LIBNAME) .til : $(TRL8FILES) 
tlib $(T800LIBHAME) -b $(TLIB8FILES) 

clargs.trl8 : clargs.h clargs.c 
pp clargs.c $ (PP0PT8) 

tcx clargs.pp $ (TCX0PT8) 

tasm clargs.tal $(TASM0PT) 

comm.trl8 : comm.h comm.c 
pp conun.c $(PP0PT8) 

tcx conun. pp $(TCX0PT8) 

tasm conun. tal S(TASMOPT) 

complex . trl8 : complex. h complex. c 
pp complex. c $(PP0PT8) 

tcx complex. pp $(TCX0PT8) 

tasm complex. tal $(TASM0PT) 

generate .trl8 : generate. h generate. c matrix. h memory. trl8 
pp generate. c $(PP0PT8) 

tcx generate. pp $(TCX0PT8) 

tasm generate. tal $(TASM0PT) 

hcube.trl8 : hcube.h hcube.c 
pp hcube.c $ (PP0PT8) 

tcx hcube.pp $(TCX0PT8) 

tasm hcube.tal $(TASM0PT) 

machine. trl8 : machine. h machine. c 
pp machine. c $(PP0PT8) 

tcx machine. pp $(TCX0PT8) 

tasm machine. tal $(TASM0PT) 

mat_ops.trl8 : mat_ops.h mat.ops.c matrix. h 
pp mat_ops.c $(PP0PT8) 

tcx mat_ops.pp $(TCX0PT8) 

tasm mat.ops.tal $(TASM0PT) 

math.trl8 : math.h math.c 
pp math.c $ (PP0PT8) 

tcx math . pp $(TCX0PT8) 

tasm math. tal S(TASMOPT) 

matrixio . trl8 : matrixio.h matrixio.c ascii. h matrix. h memory. trl8 
pp matrixio.c $(PP0PT8) 

tcx matrixio. pp $(TCX0PT8) 

tasm matrixio. tal $(TASM0PT) 

memory. trl8 : memory. h memory. c matrix. h 
pp memory. c $(PP0PT8) 
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246 tcx memory. pp $(TCX0PT8) 

247 tasm memory. tal S(TASMOPT) 

246 

249 num_sys.trl8 : num.sys.h num_sys.c matrix. h 

250 pp num_sys.c $(PP0PT8) 

251 tcx num_sys.pp $(TCX0PT8) 

252 tasm num_sys.tal $(TASM0PT) 

253 

254 sep.tr!8 : sep.h sep.c 

255 pp sep.c $(PP0PT8) 

256 tcx sep.pp $(TCX0PT8) 

257 tasm 8ep.tal $(TASH0PT) 

258 

259 timing. trl8 : timing. h timing. c 

260 pp timing. c $(PP0PT8) 

261 tcx timing. pp $(TCX0PT8) 

262 tasm timing. tal $(TASM0PT) 

263 

264 vec_ops.trl8 : vec_ops.c vec_ops.h 

265 pp vec_ops.c $ (PP0PT8) 

266 tcx vec_ops.pp $(TCX0PT8) 

267 tasm vec_ops.tal $(TASM0PT) 

266 

269 

270 # COPY LIBRARIES TO TLIB DIRECTORY 

271 

272 insta!14: 

273 copy $(T414LIBNAME) .til $(TLIB) 

274 

275 install8: 

276 copy $(T800LIBNAHE) .til S(TLIB) 



277 

278 

279 # COPY HEADER FILES TO STANDARD INCLUDE DIRECTORY 

260 

261 install.headers : 

282 copy a8cii.h $(TLIB)\. Ainclude 

283 copy macros. h $ (TLIB)\ . . \include 

284 copy matrix. h $ (TLIB)\ . . \include 

285 copy clargs.h $(TLIB)\ . . \include 

286 copy comm.h $(TLIB)\ . Ainclude 

287 copy complex. h $(TLIB)\ . Ainclude 
266 copy generate. h $ (TLIB)\ . Ainclude 

289 copy hcube.h S(TLIB) \ . Ainclude 

290 copy machine. h $(TLIB) \ . Ainclude 

291 copy mat.ops .h $(TLIB)\ . Ainclude 

292 copy math.h $(TLIB)\ . Ainclude 

293 copy matrixio.h $(TLIB)\. Ainclude 

294 copy memory. h $ (TLIB)\ . Ainclude 

295 copy num.sys.h $ (TLIB)\ . Ainclude 
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296 copy sep.h $(TLIB)\ . . \include 

297 copy timing. h $(TLIB)\ . . \include 
296 copy vec_ops.h $(TLIB) \ . . \include 

299 

300 

301 

302 

303 

304 # ======== 3.) FILE MANAGEMENT k UTILITIES ======== 

305 # 

306 # This section makes short work of a few uselul/routine tasks. 

307 # 

309 

310 

311 # =============== 3.1) Intel iPSC/2 =============== 

312 # 

313 

314 iclean: 

315 rm $ (0B JECTS) 

316 

317 
316 

319 

320 

321 # === = ======== 3.2) Logical Systems C ============ 

322 # 

323 

324 tclean: 

325 del *.pp 

326 del *.tal 

327 del *.trl 
326 

329 

330 # EOF matlib.mak 
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B. NETWORK INFORMATION FILES 



hyprcube.nif This Network Information File gives a fairly complete description of 
the hardware configuration used to perform the transputer work. 
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hyprcube.nif 



1 ; ======== NETWORK INFORMATION FILE ======== 

2 ; 

3 ; SOURCE : hyprcube.nif 

4 ; VERSION : 1.1 

5 ; DATE : 09 September 1991 

6 ; AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 ; USAGE : ld-net hyprcube 

8 ; EDITING : replace 'rootcode* with the code to run on the root 

9 ; replace 'nodecode* with appropriate code(s) for the nodes 

10 ; 

11 ; 

12 ; ============== REFERENCES ================ 

13 ; 

14 ; [1] Inmos. IMS B012 User Guide and Reference Manual. Inmos Limited, 

15 ; 1988, Fig. 26, p. 28. 

16 ; 

17 ; 

is ; ============== DESCRIPTION =============== 

19 ; 

20 ; Network Information File (NIF) used by Logical Systems C (version 89.1) 

21 ; LD-NET Network Loader. This file prescribes the loading action to take 

22 ; place when the 'ld-net 1 command is given as in USAGE above. 

23 ; 

24 ; 

25 ; ========= HARDWARE PREREQUISITES ========= 

26 ; 

27 ; NOTE: There axe three node numbering systems: the one created by Inmos * 
26 ; CHECK program, the Gray code labeling, and the NIF labeling. Since all 

29 ; three will be used on occasion, I will prefix node numbers with a C, G, 

30 ; or N to identify which system I am using! 

31 ; 

32 ; The IMS B004 and IMS B012 must be configured correctly. The B004's T414 

33 ; has link 0 connected to the host PC via a serial-to-parallel converter, 

34 ; link 1 connected to the IMS B012 PipeHead, link 2 connected to the T212 

35 ; [communications manager (not used here)] on the B012, and link 3 

36 ; connected to the IMS B012 PipeTail (see [l]). By the way, link 2 from 

37 ; the B004 goes to the the ConfigUp slot just under the PipeHead slot 

36 ; (this connects it to the T212) . Finally, the B004's Down link must run 

39 ; to the B012's Up link. 

40 ; 

41 ; 

42 . ==== SETTING THE C004 CROSSBAR SWITCHES ===== 

43 ; 

44 ; Once you have connected the hardware in the fashion mentioned above, 

45 ; the system is ready to be transformed to a hypercube. Three codes by 

46 ; Mike Esposito axe used here: t2.nif, root.tld, and switch. tld. I have 

47 ; a batch file called ' male e cube .bat 1 that performs a 'ld-net t2* also. 

46 ; 

49 ; Mike's code passes instructions to the T212 on the B012; which, in-turn 

50 ; tells the C004's how to connect their switches. After the code has 
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hyprcube.nif 



51 

52 

53 

54 

55 



56 


# 


Part rate Mb Bt 


c 


LinkO 


Linkl 


Link2 


Link3 


] 


57 


0 


T414b-15 


0.09 


0 


c 


HOST 


1:1 


2:1 


3:2 


] 


56 


1 


T800C-20 


0.80 


1 


c 


4:3 


0: 1 


5:1 


6:0 


] 


59 


2 


T2 -17 


0.49 


1 


c 


C004 


0:2 


. . . 


C004 


] 


60 


3 


T800C-20 


0.80 


2 


[ 


7:3 


8:2 


0:3 


9:0 


] 


61 


4 


T800C-20 


0.76 


3 


[ 


9:3 


10:2 


11:1 


1:0 


] 


62 


5 


T800d-20 


0.90 


1 


[ 


8:3 


1:2 


10:1 


12:0 


] 


63 


6 


T800d-20 


0.76 


0 


c 


1:3 


12:2 


7:1 


11:0 


] 


64 


7 


T800d-20 


0.76 


3 


c 


13:3 


6:2 


14:1 


3:0 


] 


65 


8 


T800d-20 


0.90 


2 


[ 


14:3 


15:2 


3:1 


5:0 


] 


66 


9 


T800c-20 


0.77 


0 


[ 


3:3 


13:2 


15:1 


4:0 


] 


67 


10 


T800d-20 


0.90 


2 


c 


16:3 


5:2 


4:1 


15:0 


] 


66 


11 


T800d-20 


0.90 


1 


c 


6:3 


4:2 


16:1 


13:0 


] 


69 


12 


T800d-20 


0.77 


0 


[ 


5:3 


16:2 


6:1 


14:0 


] 


70 


13 


T800d-20 


0.77 


3 


c 


11:3 


17:2 


9: 1 


7:0 


] 


71 


14 


T800C-20 


0.90 


1 


[ 


12:3 


7:2 


17:1 


8:0 


] 


72 


15 


T800C-20 


0.90 


2 


c 


10:3 


9:2 


8:1 


17:0 


] 


73 


16 


T800c-20 


0.76 


3 


[ 


17:3 


11:2 


12:1 


10:0 


] 


74 


17 


T800d-20 


0.88 


2 


[ 


15:3 


14:2 


13:1 


16:0 


] 



75 

76 

77 
76 
79 
60 



executed, the (very specific) configuration that we are looking for 
will exist. Specifically, the following (output from CHECK /R) is what 
this process gives us: 

check 1.21 



Here node CO is the root transputer (on the IMS B004) and node C2 is 
the T212 (on the IMS B012) . The other sixteen nodes are the T800's 
that are used for the work. A logical interconnection topology is 
described below. 



si ; 

62 . === = = =========== TOPOLOGY ================ 

83 ; 

64 ; The physical interconnection scheme described above is an actual 4-cube 

65 ; with one exception. The root node (CO) is situated BETWEEN nodes Cl 

66 ; and C3 (which would be connected directly in the usual 4-cube) . This 

87 ; gives us two 3-cubes: one whose node labeling is GOxxx and the other, 

66 ; whose node labeling is Glxxx (where the xxx represents all permutations 

89 ; of 3-bits). These axe the usual three cubes, and they will exist if we 

90 ; define the node numbering/labeling correctly. 

91 ; 

92 ; 

93 ; ================ STRATEGY ====== ========== 

94 ; 

95 ; The node labeling established by the IIF is available via the vaxiable 

96 ; _node_number (see <conc.h>) in source code. Therefore, we would like a 

97 ; smaxt labeling scheme in the NIF file so that programming is easier. 

96 ; This, of course, is subject to the restriction that NIF labels begin 

99 ; with N 1 and so on. 

ioo ; 
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hyprcube.nif 



101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 
126 

129 

130 

131 

132 

133 

134 

135 

136 

137 



One such method would be to define a IIF labeling so that the Gray code 
label for a node would be (_node_number -2). In fact, this is 
possible and the adjacencies defined below allow us to realize this 
feature. Below, node M0 is the host PC, node Ml is the root transputer 
(T414 on the B004) , M2 through 117 correspond to GO through G15 (the 
nodes of a 4-cube), and M18 is not used (but it's the T212) . 



host.server cio.exe; 



MODE 

ID 

1, 

2, 

3 , 

4, 

5, 

6, 

7, 

8, 
9, 

10 , 

11 , 

12 , 

13, 

14, 

15, 

16, 

17, 

18, 



TRAHSPUTER 
LOADABLE 
CODE ( . tld) 

rootcode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
nodecode , 
switch, 



(default) 

RESET 

COMES 

FROM: 



rO, 
rl, 
r2 , 
r5, 
r3 , 
r7 , 
r9 , 
r4 , 
r8 , 
rll , 
rl3 , 
rl6, 
rl2, 
r6, 
rl4, 
rl7, 
rl5, 
si, 



DESCRIPTI0M OF LIMK C0MNECTI0MS 
LINK0 LINK1 LIMK2 LIMK3 



0, 

4, 
11 , 
12 , 

9, 

2 , 

3, 

6 , 

17, 

14, 

15, 

10 , 

5, 

16, 

7, 

8 , 

13, 



2 , 

1, 

2 , 

5, 

3, 

7, 
9, 

4, 

8 , 

11 , 

13, 
16, 
12 , 

6 , 

14, 
17, 

15, 

1, 



3, 

5, 

8 , 

4, 

14, 

6 , 
9, 
7, 
1, 

10 , 

13, 

11 , 

15, 
17, 
12 , 

16, 



10 

6 

7 
2 

13 

8 

15 

16 
5 

12 

3 

4 
17 
10 
11 

14 
9 



B004 

B012 



T212 



EOF hyprcube.nif 
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C. STANDARD FILES 



macros. h This header file gives several C macros that are used in other programs, 
matrix. h This header file establishes the standard definition of a matrix. 
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macros. h 



1 /♦ ========== PROGRAM INFORMATION ========== 

2 * 

3 * SOURCE : macros. h 

4 ♦ VERSION : 1.3 

5 * DATE : 14 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 * 

9 */ 

10 

11 



12 #define MAX(x,y) ( ( ( x) > (y)) ? (x) : (y)) 

13 

H #def ine MIN(x,y) (((x) > (y)) ? (y) : (x)) 

15 

16 #def ine P0V2(n) ((1) « (n)) 

17 

18 

19 

20 
21 

22 /* ============== EOF macros. h ========== 



♦/ 
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matrix, h 



1 /* ====== ==== PROGRAM INFORMATION == ====== = = 

2 * 

3 * SOURCE : matrix. h 

4 * VERSION : 2.0 

5 * DATE : 02 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 * 

8 * 

9 * = = = = = = = = = = = = = = DESCRIPTION = === =========== 

10 * 



11 * A header file for a family of functions designed to work with matrices. 

12 * 

14 */ 

15 

16 #include "complex. h" /* for Complex.Type */ 

17 



18 








19 








20 


/* 




MANIFEST CONSTANTS 


21 








22 








23 


#def ine 


BASE.TEN 


10 


24 


#def ine 


CURRENT 


1 


25 


#if ndef 


EXIT.FAILURE 




26 


#def ine 


EXIT_FAILURE 


1 


27 


#endif 






28 


#if ndef 


EXIT.SUCCESS 




29 


#def ine 


EXIT.SUCCESS 


0 


30 


#endif 






31 


#def ine 


FAILURE 


1 


32 


#def ine 


FALSE 


0 


33 


#def ine 


LINE.LENGTH 


80 


34 


#def ine 


MAX.NAME.LENGTH 


80 


35 


#def ine 


NO 


0 


36 


#def ine 


OFF 


0 


37 


#def ine 


ON 


1 


38 


#def ine 


0NE.BYTE 


1 


39 


#def ine 


ONE.MEMBER 


1 


40 


#def ine 


PREVIOUS 


0 


41 


#def ine 


SUCCESS 


0 


42 


#def ine 


TRUE 


1 


43 


#def ine 


TYPE.CHAR 


0 


44 


#def ine 


TYPE_D0UBLE 


1 


45 


#def ine 


TYPE.FL0AT 


2 


46 


#def ine 


TYPE.INT 


3 


47 


#def ine 


YES 


1 


46 









49 



50 
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matrix, h 



51 /♦ = = = = = === = = TYPE definitions ======== = = ♦ / 

52 

53 

54 typedef struct { 



55 






56 


chax 


♦name ; 


57 


int 


rows , 


56 




cols ; 


59 

60 


double 


♦♦matrix ; 



61 > Matrix_Type; /♦ def ault/standard is type double ♦/ 

62 

63 

64 

65 typedef struct { 



66 

67 


char 


♦name ; 


66 


int 


rows , 


69 




cols ; 


70 


Complex_Type 


♦♦matr 


71 







72 > Complex_Matrix_Type ; /♦ type Complex_Type is in complex. h ♦ / 

73 

74 

75 

76 typedef struct { 



77 






76 


chax 


♦name ; 


79 


int 


rows , 


60 




cols ; 


61 

62 


double 


♦♦matrix; 



63 > Double_Matrix_Type; 

64 

65 

66 

67 typedef struct { 



66 






69 


chax 


♦name ; 


90 


int 


rows , 


91 




cols ; 


92 

93 


float 


♦♦matrix; 



94 > Float_Matrix_Type ; 

95 

96 

97 

96 typedef struct { 

99 

loo char ♦name; 
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matrix, li 



101 


int 


rows , 


102 




cols ; 


103 


int 


♦♦matrix ; 


104 







105 } Int_Matrix_Type ; 

106 
107 

106 /♦ = = = = = = = = ==== EOF 



matrix .h 
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D. SOURCE CODE FILES 



There is one header file and one (.c) source code file for each remaining member 
of the library, so the filename is given without the suffix. 

allocate Memory allocation and management functions. 

clargs For processing command-line arguments. 

comm Communications functions for the hypercubes. 

complex Complex numbers and operations. 

epsilon Machine precision functions. 

generate Matrix generation functions. 

io Input/output (10) functions. 

mathx A small extension to the C math library. 

num_sys Various number systems (binary, decimal, hexadecimal). 

ops Matrix and vector operations. 

timing Functions for timing. 

Again, however, most of the source code has been omitted and only the header 
files remain. The singular exception is complex. c because this source contains an 
algorithm referenced earlier in the thesis. 
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allocate.h 



1 / 

2 

3 

4 

5 

6 
7 
6 
9 

10 

11 

12 

13 

14 

15 

16 
17 
16 

19 

20 
21 
22 

23 

24 

25 

26 

27 / 
26 

29 

30 

31 

32 

33 

34 

35 

36 

37 
36 

39 

40 

41 

42 

43 

44 

45 

46 

47 
46 

49 

50 



========== PROGRAM INFORMATION ========== 



SOURCE 

VERSION 

DATE 

AUTHOR 



allocate . h 

2.0 

09 September 1991 

Jonathan E. Hartman, U. S. Naval Postgraduate School 



============== DESCRIPTION ============= 

Declarations of functions associated with memory allocation. 



========= LIST OF FUNCTIONS =========== 



cmatalloc () 
intvecallocO 
matalloc ( ) 



/ 



======= FUNCTION DECLARATION ========== 



PURPOSE: This function performs the memory allocation for a matrix 

structure (of the Complex_Matrix_Type) using the C function 
callocO. Additionally, it fills the "rows'* and "cols" 
fields of the matrix structure returned with the parameters 
passed to the function. If a structure is returned (see 
"RETURNS"), then its "rows" and "cols" fields will be 
filled with the correct values. The structure type is 
defined in "matrix. h". 



INCLUDE: "allocate.h" 

CALLS: callocO 

CALLED BY: 



PARAMETERS: int rows the number of rows in the desired matrix 

int cols the number of columns in the desired matrix 

RETURNS: A pointer to the structure if successful; NULL otherwise. 

The NULL case includes non-positive rows or cols in addi- 
tion to the obvious allocation failure. 



231 




allocate, h 



51 * EXAMPLE: Complex_Matrix_Type ♦A; 

52 * 

53 * A = cmatalloc(7, 7); 

54 * 

56 */ 

57 

56 #if def PROTOTYPE 

59 

60 Complex_Matrix_Type *cmatalloc(int rows, int cols); 

61 

62 #else 

63 

64 Complex_Matrix_Type *cmatalloc() ; 

65 

66 #endif 

67 
66 

69 

70 

71 



72 /* ========== FUNCTION DECLARATION ========== 

73 * 

74 * PURPOSE: This function performs the memory allocation for a vector, 

75 * v, of num_elements integer elements. 

76 * 

77 * INCLUDE: "allocate . h M 

76 * 

79 * CALLS: callocQ 

60 * 



61 * CALLED BY: 

62 * 

63 ♦ PARAMETERS: See PURPOSE. 

64 * 

65 * RETURNS: A pointer to the array if successful and NULL otherwise. 

66 * 

67 * EXAMPLE: int desired_size_of _v = 7, 

66 * *v; 

69 * 

90 * v = intvecalloc(desired_size_of _v) ; 

91 * 

93 */ 

94 

95 

96 #ifdef PROTOTYPE 

97 

96 int *intvecalloc(int num_elements) ; 

99 

loo #else 
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allocate, h 



101 

102 

103 

104 

105 

106 

107 

108 
109 

no 

111 

112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128 

129 

130 

131 

132 

133 

134 

135 

136 

137 

138 

139 

140 

141 

142 

143 

144 

145 

146 

147 

148 

149 



int ♦intvecallocO ; 
#endif 



FUNCTION DECLARATION 



PURPOSE: 



This function performs the memory allocation for a matrix 
structure using the C function callocC). Additionally, it 
fills the "rows*' and "cols'* fields of the matrix structure 
returned with the parameters passed in to the function. 

If a structure is returned (see "RETURNS"), then its "rows' 
and "cols" fields will be filled with the correct values. 
The structure type is defined in "matrix.h". 



callocO 



INCLUDE: "allocate . h" 

CALLS : 

CALLED BY: 

PARAMETERS: int rows 
int cols 



the number of rows in the desired matrix 
the number of columns in the desired matrix 



/* * 

* 

* 

♦ 

* 

♦ 

* 

* 

* 

* 

* 

♦ 

* 

* 

* 

♦ 

* 

* 

* 

* 

* 

* 

* 

* 

♦ 

* - 

*/ 



#if def PROTOTYPE 

Double_Matrix_Type *matalloc(int rows, int cols); 
#else 

Double_Matrix_Type ♦matalloc(); 

#endif 



RETURNS : 



A pointer to the structure if successful; NULL otherwise. 
The NULL case includes non-positive rows or cols in addi- 
tion to the obvious allocation failure. 



EXAMPLE: Double_Matrix_Type *A = matalloc(7, 7); 



/* 



EOF allocate. h 



♦/ 
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PROGRAM INFORMATION 



clargs.h 



1 /♦ 

2 * 

3 * 

4 * 

5 * 

6 * 

7 * 

8 * 

9 * 

10 * 
11 * 
12 * 

13 * 

14 * 

15 * 

16 * 

17 * 

18 * 

19 * 

20 * 
21 * 
22 * 

23 * 

24 * 

25 * 

26 * 

27 * 

28 * 

29 * 

30 * 

31 * 

32 * 

33 * 

34 * 

35 * 

36 * 

37 * 

38 * 

39 * 

40 * 

41 * 

42 * 

43 * 

44 * 

45 * 

46 * 

47 * 

48 ♦ 

49 * 

50 * 



SOURCE 

VERSION 

DATE 

AUTHOR 



clargs .h 
1.5 

09 September 1991 

Jonathan E. Hartman, U. S. Naval Postgraduate School 



============== DESCRIPTION ============= 

This header file gives the declarations to accompany clargs.c. These 
files provide a standard (if somewhat limited) way of handling command- 
line arguments. The objective is to handle: 

1. ) Simple boolean arguments like "if -v exists, set verbose = TRUE". 

We will call such an argument a ' simple , argument type. This 
type of argument can be recognized by the fact that it has no 
sub- arguments (the sub-argument count, subargc == 0). 

2. ) Arguments with sub-arguments to be interpreted as numbers. Ve 

will this a 'complex* argument type. Suppose that we want to set 
int dim = 3 when the command line arguments contain "-d 3 " . 

This case implies several requirements: 

a. ) First, we must know in advance how many sub -arguments the 

argument has — we*ll call this subargc (in this case we are 
expecting one sub-argument, so the caller would have set 
subargc = l) . 

b. ) Secondly, we must know how to interpret each sub-argument 

[i.e., what type is the sub-argument? Is it a double or long 
(float and int can be handled by type casting)?] 

Ve will call this kind of argument a complex argument type. They 
can be recognized as those with subargc > 0. 

Here is the strategy. The user maikes a list of valid command-line 
arguments by creating an array of pointers to structures of type 
Ar g .Struct. We will call this the option list, (Arg.Struct *) optv[]. 
The code assumes that you can do something like this at the top of your 
source : 

#def ine MAX_NUMBER.0F.ARGS 3 

static Arg.Struct *optv [MAX_HUMBER.0F.ARGS] ; 

Let (int) optc, be the option count (number of options). Every element 
in (pointed to by) the option list is a structure of type Arg.Struct 
defined below. By using the standard C argc and argv; and by creating 
and passing optc and optv around, we can manipulate command-line 
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clargs.h 



51 * arguments just about however we want. The next step is to understand 

52 * the structure. 

53 * 

54 * 

55 * =========== LIST 0F functions ============ 

56 * 

57 * install. complex_arg() 

58 * install. simple. arg( ) 

59 * interpret.args () 

60 * 

62 */ 

63 

64 

65 

66 /* = = = = = = = = = = MANIFEST CONSTANTS = = = = = = = = = = ♦/ 

67 



68 

69 tif ndef EXIT. FAILURE 

70 tdef ine EXIT. FAILURE 1 

71 tendif 

72 tif ndef EXIT.SUCCESS 

73 tdefine EXIT.SUCCESS 1 

74 tendif 

75 tif ndef FALSE 

76 #define FALSE 0 

77 tendif 

78 #if ndef NULL 

79 #def ine NULL 0 

80 #endif 

81 #if ndef SUCCESS 

82 #def ine SUCCESS 0 

83 #endif 

84 #if ndef TRUE 

85 #def ine TRUE 1 

86 #endif 



87 

88 

89 /* 

90 * The maximum number of characters in an argument name, MAX.ARGLEN is a 

91 * relatively arbitrary thing ... .make it whatever you want. The DOUBLE 

92 * and LONG manifest constants are assumed to be used for values of 

93 * subargi (see the structure below). 

94 */ 

95 

96 tdefine MAX.ARGLEN 7 

97 tdefine DOUBLE 0 

98 tdefine LONG 1 

99 
100 
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DATA STRUCTURES 



clargs.h 



101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 
127 
126 

129 

130 

131 

132 

133 

134 

135 

136 

137 
136 

139 

140 

141 

142 

143 

144 

145 

146 

147 
146 

149 

150 



argname The (string) name of a valid argument. For instance, if 

you want the simple argument "-v", then argname [] would be 
"-v". If you have a complex argument that will appear as 
"-number 3 4.5 6.7" , then argname will be "-number" and you 
must use the sub-argument variables below to handle the 
integer and two floating-point values. 

subargc Consider the "-number" exaunple again. There are three sub- 
arguments (3, 4.5, and 6.7) so the sub-argument count would 
be 3. 

subargi[] This array tells us how to interpret the subarguments . For 
instance, again using the "-number" example above, we would 
set subargi [0] = LONG; subargi[l] = DOUBLE; and 
subargi [2] = DOUBLE. 

found This should is initialized to FALSE. The function 

interpret_args() will set this field TRUE if the argname [] 
appears on the command-line (in ♦ argv[]). 

dsa[] This field is an array of double sub-arguments. 

lsa[] This field is an array of long sub- arguments . 

Consider the "-number" example again. After argument resolution, we 
would find that dsa[0] is not defined since subargi[0] == LONG. 

However, we can use subargi[] to verify that subargi [l] and subargi [2] 
are DOUBLE. Knowing this, we can safely presume that the values with 
CORRESPONDING index in dsaQ should be interpreted as doubles. That 
is, dsa[l] will be a double value (4.5) and dsa[2] will also be a 
double (6.7). In a similar manner, lsa[0] must be a long (3) and 
lsa[l] and lsa[2] are not defined. 



/ 

typedef struct { 

char argname [MAX.ARGLEN] ; 



int 



subargc , 

♦ subargi , 
found; 



double *dsa; 
long *lsa; 

> Arg .Struct ; 



/♦ how many subarguments expected ♦/ 
/♦ how to interpret subarguments */ 
/♦ set TRUE if the argument is found ♦/ 

/♦ double-valued sub-arguments */ 
/♦ long-valued sub-argument list ♦/ 
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FUNCTION DECLARATION 



clargs.h 



151 

152 

153 / 

154 



169 

170 

171 

172 

173 

174 

175 

176 

177 
176 
179 
160 



155 


* 


PURPOSE: 


To install 


a valid com 


156 


* 




optv [3 . 




157 


* 








156 


* 


INCLUDE: 


M clargs . h M 




159 


* 








160 


* 


CALLS : 


strcpyO 




161 


♦ 








162 


* 


CALLED BY: 






163 


* 








164 


* 


PARAMETERS : 


int 


index; 


165 


* 




Arg.Struct 


• optv [] ; 


166 


* 




const char 


•argname; 


167 


* 




int 


♦interpret , 


166 


* 






subargc ; 



The first three parameters are exactly like the corresponding ones for 
install_simple_arg() . Additionally , for complex arguments, we need to 
pass in instructions concerning how many sub-arguments there are (i.e., 
subargc) and how to interpret each. The array interpret [] should be 
filled with subargc elements when you call this function. The elements 
should only be valid ones (e.g., DOUBLE, LONG). 



#if def PROTOTYPE 



161 

162 void install_complex_arg(int index, Arg_Struct ♦ optv[] , 

163 const char ♦argname, int ♦interpret, 

164 int subargc); 

165 #else 

166 

167 void install_complex_arg() ; 

166 

169 #endif 

190 

191 

192 

193 

194 



195 /♦ ========== FUNCTION DECLARATION ========== 

196 * 

197 * PURPOSE: To install a valid simple argument in the option list, 

196 * optv [] . 

199 * 

200 * INCLUDE: M clargs.h M 
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clargs.h 



201 * 

202 * CALLS: strcpyO 

203 * 

204 * CALLED BY: 

205 * 

206 * PARAMETERS: int index; 

207 * Arg .Struct *optv[]; 

206 * const char *argname; 

209 * 

210 * The 'index 1 gives the location o f the option in the option list, 

211 * optv[]. The function uses this index to install the argname at the 

212 * proper location in optv[]. For instance, set this variable to zero for 

213 * the first option in the list, formal C indexing convention applies; 

214 * namely, 0 <= index < MAX_IUMBER_0F_ARGS . The 'argname 1 is the string 

215 * that you want recognized as a valid argument. For instance, suppose 

216 * that you want a timing argument to be recognized whenever M -t" appears 

217 * on the command line. Then you would supply M -t" in this place. 

216 * 

220 */ 

221 

222 #if def PROTOTYPE 

223 

224 void install. simple. arg(int index, Arg.Struct *optv[], 

225 const char *argname) ; 

226 #else 

227 

226 void install.simple.argO ; 

229 

230 #endif 

231 

232 

233 

234 

235 



236 /* ========== FUNCTION DECLARATION ========== 

237 * 

236 * PURPOSE: Once the user has defined an appropriate option list, 

239 * optv[], with optc options, this function parses the 



240 * command-line arguments (as given by argc and argv) and fills the 

241 * *optv[] structures appropriately. For instance every valid (exists in 

242 * optv ==> valid) argument that appears on the command line will result 

243 * in the corresponding optv structure's 'found 1 field being set to TRUE. 

244 * The function also interprets sub-arguments and fills dsa[] and/or lsa[] 

245 * accordingly. It assumes that the caller has established the desired 

246 * argname's, subargc's, and subargi's. 

247 * 

246 * INCLUDE: "clargs.h” 

249 * 

250 * CALLS: printfO 
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251 


* 




strcmpO 


252 


* 




strtod() 


253 


* 




strtolO 


254 


♦ 






255 


* 


CALLED BY: 




256 


* 






257 


* 


PARAMETERS: 


As descr 


258 


* 






259 








260 


*/ 






261 








262 









263 

264 

265 

266 

267 

268 

269 

270 

271 

272 

273 

274 /* 



#ifdef PROTOTYPE 

void interpret_args (int argc, char **argv, int optc, Arg.Struct **optv) ; 
#else 

void interpret_args() ; 

#endif 



EOF clargs.h 



*/ 
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comm.h 



1 /♦ =========== PROGRAM information ========== 

2 * 

3 * SOURCE : comm.h 

4 * VERSION : 2.S 

5 * DATE : 14 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. laval Postgraduate School 

7 * 

8 * 

9 * ============== DESCRIPTION ============= 

io * 

n * This header file gives manifest constants and function specifications 

12 * for comm.c. These files contain communication (and related) functions 

13 * for a normal hypercube topology and a hybrid topology. Unfortunately 

14 * the code is a bit busy with tifdef's, but the purpose of these files is 

is * to make hypercubes a little more transparent . This makes the comm.h 

16 * and comm.c files a bit hard to read, but you should be able to recoup 

17 * this loss when it comes time to write a particular application. 

18 * 

19 * 

20 * =============== TOPOLOGIES ============= 

21 * 

22 * The functions specified below have been designed to work on three very 

23 * different machines. First, the Intel iPSC/2 with a normal hypercube of 

24 * order 0, 1, 2, or 3 is handled. A normal hypercube of transputers is 

25 * next on the list (also order 0, 1, 2, or 3) . Finally, there is a 

26 * hybrid topology of transputers that is handled. The normal hypercubes 

27 * need almost no introduction. We have a host or root processor/program 

28 * together with programs running on the nodes. I will use host and root 

29 * interchangeably here, although 'host* is properly associated with the 

30 * Intel machine and 'root* is the more correct/descriptive term when the 

31 * subject is transputer networks. The hybrid topology deserves a more 

32 * careful introduction. 

33 * 

34 * The hybrid topology is a network of Inmos transputers (PC host with an 

35 * IMS B004 board and a T414 linked to sixteen T800 processors on an IMS 

36 * B012 board) arranged so that the 'root 1 is situated between nodes zero 

37 * and eight of a 4-cube. This means that nodes 0 and 8 are NOT directly 

38 * connected. The functions made for this topology compensate for this 

39 * situation. Instead of trying to describe each function, I will simply 

40 * remark that the most natural way to treat this problem is (more-or- 

41 * less) as two 3-cubes attached to the root. A more careful description 

42 * of how each problem is handled may be found in the code for the parti- 

43 * cular function. 

44 * 

45 * In summary , the transputer portions of the code depend upon: (1) a very 

46 * specific hardware configuration, (2) the appropriate NIF file to 

47 * support the usual Gray code in a convenient way 

48 * 

49 * [ mynodeQ == _node_number - 2 ], 

50 * 
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51 * amd (3) a paxticular link axrangement like that cam be created by Mike 

52 * Esposito's t2.nif, root.tld, amd switch. tld. 

53 * 

54 * DETAILS: Look lor additional details in hyprcube . nil . 

55 ♦ 

56 * 

57 ♦ ============== PREREQUISITES ============= 

56 * 

59 * Before using any of the functions involving sendQ or receiveO, the 

60 * host (or root) program must initialize_hypercube() . For transputer 

61 * applications, EACH of the BODES must initialize.hypercubeQ too, amd 

62 * you need to be sure that a hypercube exists in haxdware amd that your 

63 * IIF describes a hypercube with the usual Gray code. You must define 

64 * the global variables {Channel *ic[], ♦oc[];} because the code depends 

65 * upon their existence. Both of these vectors must be of length 

66 * (cubesize+1) as described in the preface to initialize. hypercube () . 

67 * 

66 * The cubesize and dimension that you use with the tramsputer implementa- 

69 * tion determine the cube. Even though you actually have sixteen T800's 

70 * in the cube, the cubesize and dimension that you use will determine the 

71 * portion that actually gets used. Mote that both the usual hypercube 

72 * and the hybrid 4-cube are built upon the same hardware and link setup. 

73 * Mamy of the functions declaired below DEPEMD upon the proper call to the 

74 * initialize.hypercubeQ function. To avoid difficulty, observe the 

75 * guidelines given with this function! Additionally, in the tramsputer 

76 * case, you will need to make sure that you include <conc.h>. 

77 ♦ 

76 * 

79 * =========== LIST of functions ============ 

60 * 

61 * coalesceQ 

62 * cubecastQ 

63 ♦ cubecast.f rom( ) 

64 * direct ional.exchamge () 

65 ♦ directional. receiveO 

66 * directional.sendQ 

67 * hamming.distanceQ 

66 ♦ initialize.hypercubeQ 

69 ♦ least. dimension() 

90 * link.number () 

91 * linkinO 

92 * linkoutQ 

93 * receiveQ 

94 ♦ sendO 

95 ♦ submit () 

96 * 

96 ♦/ 

99 

100 
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101 

102 

103 

104 



/* 



MACROS k MANIFEST CONSTANTS 



*/ 



♦ifdef TRANSPUTER 



105 


#def ine 


myhost () 


-1 








106 


#def ine 


mynodeO ( 


_node. 


_number - 2) /* depends upon <conc.h> */ 


107 














106 


#else 


/* iPSC/2 */ 










109 














110 


#def ine 


ALL.N0DES 


-1 








111 


#def ine 


ALL.PIDS 


-1 








112 


#def ine 


ANY.N0DE 


0 


/* 


for receive(from any node, .. 


. ) */ 


113 


♦define 


ANY_TYPE 


-1 


/* 


first non-force-type message 


*/ 


114 


♦define 


ARBITRARY.TYPE 


0 


/* 


don't care 


*/ 


115 


♦define 


KEEP_TIL_RELCUBE 


1 


/* 


for getcubeQ 


*/ 


116 


♦define 


N0DE.PID 


0 


/* 


arbitrary ... don't care 


*/ 


117 


♦if ndef 


NULL 










116 


♦define 


NULL 


0 









119 ♦endif 

120 

121 ♦endif 

122 

123 

124 ♦ifndef FALSE 

125 ♦define FALSE 

126 ♦endif 

127 

126 ♦ifndef TRUE 

129 ♦define TRUE 

130 ♦endif 

131 

132 



133 

134 

135 

136 

137 
136 

139 

140 

141 

142 

143 

144 

145 

146 

147 
146 

149 

150 



/* 

* 

* 

♦ 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 



FUNCTION DECLARATION 



PURPOSE: This function performs the first step in the opposite of 

the cubecast() function. That is, this one is used when 
you want to collect information from the nodes in 'higher dimensions ' 
of the hypercube at the current node. You may want to perform some work 
before forwarding this information down to the next lower dimension, so 
the submit () function is given separately. 

Like the other functions in this file, coalesceO performs a somewhat 
different task when executed in the hybrid 4-cube, so first we will 
discuss the usual hypercubes. coalesceO is a null operation when 
called from in the highest dimension [ if least_dimension(node) is 
equal to dim ] . Otherwise it performs the communication to receive 
from higher dimensions (i.e., neighbors with larger node numbers). If 
it is called from the host/root, it attempts to receiveO from node 
zero . 
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151 

152 

153 

154 

155 

156 

157 

158 

159 

160 
161 
162 

163 

164 

165 

166 
167 
166 

169 

170 

171 

172 

173 

174 

175 

176 

177 
176 
179 
160 
161 
182 

163 

164 

185 

186 
167 
166 

189 

190 

191 

192 

193 

194 

195 

196 

197 

198 

199 

200 



The coalesceQ and submitO functions must be balanced properly across 
the nodes. The CALLER must take the necessary steps to be sure that 
buf is large enough to hold ((dim - least_dimension(node) ) * len) 
bytes. That is, there will be (dim - least. dimension(node) ) copies of 
the message accumulated at the calling node. 

There are several exceptions in the hybrid 4-cube topology. Since the 
root is connected to nodes 0000 and 1000, it must make sure that buf 
can hold 2 copies of length, len. Then you should think of nodes Oxrx 
as one 3-cube and nodes lxxx as another (more-or-less separate) 3-cube. 
That is, there will be no exchanges in the lxxx direction between them. 
To determine the size of buf at any node, use the following formulae: 

(3 - least.dimension(node) ) * len, lodes Oxxx 

(3 - least_dimension(node - 8)) * len, lodes lxxx 



CAUTIONS: If you fail to allocate enough space for buf, you may find 

that your program doesn't work. 



The transputer implementation depends upon the parameter 
'type' being set equal to cubesize. 

PREREQUISITE : initialize.hypercube () 



INCLUDE: <conc.h> (Logical Systems C, version 89.1) 

"comm.h" 

CALLS: least_dimension() 

myhostO (macro given above) 

pow2( ) "mathx . h" 

receiveO 

CALLED BY: 



EXAMPLE: Suppose we are 'at* node 0 and we want to coalesceQ copies 

of some object from all of the appropriate nodes. Let the 
object be of size ‘len' bytes. For concreteness, let the topology be a 
hypercube of order 3 (i.e., dim == 3). We would allocate a large enough 
buf to hold (dim * len) bytes, since least_dimension(0) == 0. That is, 
node 0 will be receiving from all neighbors whose least_dimension( ) is 
greater [in this case, that is ALL of its neighbors]; namely, 1, 2, and 
4. After the call, we would find the data from node 1 in the first len 
bytes of buf; the data from 2 in the middle len bytes of buf; and the 
data from 4 in the final len bytes of buf. The function is treated as 
a multiple receiveO, in increasing origin order, from the appropriate 
neighbors . 



PARAMETERS : 
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201 


* 


int 


node 


202 


* 


int 


dim 


203 


* 


char 


♦buf 


204 


* 






205 


* 


long 


len 


206 


♦ 






207 


* 


long type 


208 


♦ 






209 


* 






210 


* 


— 


= : 


211 


*/ 







212 #ifdef PROTOTYPE 



the coalesce()ing (receiving) node 
the dimension of the hypercube 

a pointer to the beginning of the buffer where you want 
the message placed. 

the number of bytes to be received from EACH node in 
the next higher dimension that will be submit()ing. 
the type of the message (iPSC/2 applications only), or 
cubesize in the transputer case. 



213 

214 void coalesce(int node, int dim, char ♦buf, long len, long type); 

215 

216 #else 

217 

218 void coalesce(/* int node, int dim, chax ♦buf, long len, long type ♦/); 



219 

220 #endif 

221 
222 

223 

224 /♦ ========= FUNCTION DECLARATION ========= 

225 * 

226 * PURPOSE: This function is called from the root/host and all nodes to 

227 * execute a broadcast to all p nodes. The host/root sends to 



228 * node zero to start the process off. Let lg(n) denote log_2(n). This 

229 * function performs the communication in lg(p) steps. For instance, node 

230 * zero receives from the host in what we'll call stage zero. Then, in 

231 * stage 1, node 0 passes the message to node 1. In stage 2, node 0 sends 

232 * the message to node 2 and node 1 sends it to node 3. In stage three, 

233 * nodes 0, 1, 2, and 3 each send the message to nodes 4, 5, 6, and 7 

234 * (respectively). 

235 * 

236 * Then, in general, in stage i, the message moves into the ith dimension. 

237 * If you prefer, you can think of a pointer starting (after the message 

238 * arrives at node 0) at the rightmost bit (LSB) and indicating the direc- 

239 * tion for the next transmission. The pointer moves left until it 

240 ♦ reaches the MSB. This is the final stage of the cubecast(). 

241 * 

242 * The hybrid 4-cube is implemented by sending the message from the root 

243 * to nodes 0 and 8 first. Then node 0 performs the usual cubecast for 

244 * the nodes that appear in the usual 3-cube. Node 8 mirrors this action, 

245 * filling the other three-cube with labels like lxxx. 

246 * 

247 * In all cases, buf is filled with an initial receiveQ from the proper 

248 * node, and then it is used in retransmissions to other nodes. In any 

249 ♦ event, buf holds the message after execution. 

250 * 
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251 

252 

253 

254 

255 

256 

257 
256 

259 

260 
261 
262 

263 

264 

265 

266 
267 
266 

269 

270 

271 

272 

273 

274 

275 

276 

277 
276 
279 
260 



* 

* 

* 

* 

♦ 

♦ 

* 

♦ 

* 

* 

* 

* 

* 

♦ 

* 

♦ 

* 

* 

* 

* 

♦ 

* 

* 

* 

* 

* 



CAUTION: The transputer implementation depends upon the parameter 

'type 1 being set equal to cubesize. 

PREREQUISITE: initialize_hypercube( ) 



INCLUDE: <conc.h> 

"comm.h" 

CALLS: least_dimension() 

MIN() 
myhost () 
pow2( ) 
receiveO 
send() 

CALLED BY: 



(Logical Systems C, version 89.1) 



(macro from macros. h) 
(macro from above) 
"mathx.h" 



PARAMETERS: 



int node 
int dim 
char *buf 
long len 
long type 



the sending node 
the dimension of the hypercube 
a pointer to the head of the message 
the number of bytes to be passed 

the type of the message (iPSC/2 applications only), or 
cubesize in the transputer case. 



*/ 



261 #if def PROTOTYPE 



262 

263 void cubecast(int node, int dim, char *buf, long len, long type); 

264 

265 #else 

266 

267 void cubecast(/* int node, int dim, char *buf , long len, long type */) ; 
266 

269 #endif 

290 

291 

292 

293 

294 



295 /* ====== === FUNCTION DECLARATION ========= 

296 * 

297 * PURPOSE: This function is similar: to cubecastO but more general. 

296 * Here we do not assume that the message starts at the host 



299 * or at node zero; it may start at any general source node, src. In fact, 

300 * it may NOT be called from the root/host (use cubecastO in that case). 
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301 

302 

303 

304 

305 

306 

307 
306 

309 

310 

311 

312 

313 

314 

315 

316 

317 
316 

319 

320 

321 

322 

323 

324 

325 

326 

327 
326 

329 

330 

331 

332 

333 

334 

335 

336 

337 
336 



If dim is the order of the hypercube, then src goes through dim stages, 
passing the message to its neighbors. The sequence is defined by an 
X0R operation that starts at bit 1 of src and moves up through bit dim. 
For instance, suppose src == 5 == 101b in the 3-cube (dim == 3). Then 
src will first send to (101 X0R 001) == node 4, next to (101 X0R 010) 

== node 7, and finally to (101 X0R 100) == node 1. Meanwhile, any time 
that a non-source node gets the message, he begins the same process, 
but only picks it up at the appropriate stage (the one after the stage 
in which he received the message). 

PREREQUISITE : initialize_hypercube () 

ISCLUDE: <conc.h> (Logical Systems C, version 89.1) 

"comm.h" 

CALLS: directional_receive () 

directional_send() 
free() 

least_dimension() 

malloc() 

pow2() "mathx.h" 

receiveO 

send() 

sizeof () 

CALLED BY: 



PARAMETERS: 



int src 
int node 
int dim 
char *buf 
long len 



the source 

the number of the node calling this function 
the dimension of the hypercube 
a pointer to the head of the message 
the number of bytes to be passed 



/ 



339 

340 #if def PROTOTYPE 



341 

342 void cubecast_f rom(int src, int node, int dim, char ♦buf, long len); 

343 

344 #else 

345 

346 void cubecast_from() ; 

347 

346 #endif 

34 9 

350 
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351 /* 

352 * 

353 * 

354 * 

355 * 

356 * 

357 * 

356 * 

359 * 

360 * 

361 * 

362 * 

363 * 

364 * 

365 ♦ 

366 * 

367 * 

366 * 

369 * 

370 * 

371 * 

372 * 

373 * 

374 * 

375 * 

376 * 

377 * 

376 * 

379 * 

360 * 

361 * 

362 * 

363 */ 

364 
3ft 5 



======== FUNCTION DECLARATION ========= 



PURPOSE: To perform an exchange along a prescribed direction. The 

direction is given as an integer in {1, 2, 4, 8 , . . . , 2‘dim} . 
This is because the direction is really a bit mask for the Gray-coded 
node numbers. For instance, if you perform a directional_exchange( ) 
from node == 3 == Oil in the 3-cube along direction == 4 == 100, this 
is the Bane as performing a coordinated send() and receiveQ combina- 
tion with node (Oil I0R 100 == 111 == 7). Care is taken to make sure 
that deadlock does not occur. 



PREREQUISITE: initialize_hypercube( ) 

IICLUDE: <conc.h> (Logical Systems C, version 89.1) 

"comm . h" 

CALLS: pov2() "mathx.h” 

receiveO 
send( ) 

CALLED BY: 



PARAMETERS: 



int node 
int dim 
int direction 
char *ibuf 
char *obuf 
long len 



the number of the node calling this function 
the dimension of the hypercube 
as described above (1, 2, 4, 8, etc.) 
a pointer to the head of the incoming message 
a pointer to the head of the outgoing message 
the number of bytes to be passed 



366 #if def PROTOTYPE 



367 

366 

369 

390 

391 

392 

393 

394 

395 

396 

397 
396 

399 

400 



void directional_exchange( int node, int dim, int direction, 

char *ibuf, char *obuf, long len); 

#else 

void directional_exchange() ; 

#endif 
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401 /♦ 

402 * 

403 * 

404 * 

405 * 

406 * 

407 * 

406 * 

409 * 

410 * 

411 * 

412 * 

413 * 

414 * 

415 * 

416 * 

417 * 

416 * 

419 * 

420 * 

421 * 

422 * 

423 * 

424 * 

425 * 

426 

427 



/ 



========= FUNCTION DECLARATION ========= 

PURPOSE: To receive from a prescribed direction. The direction is 

as described in directional_exchange( ) above. 

PREREQUISITE: initialize_hypercube() 

INCLUDE: <conc.h> (Logical Systems C, version 89.1) 

"conun.h" 

CALLS: pow2() "mathx.h" 

receiveQ 

CALLED BY: 

PARAMETERS: 

int node the number of the node calling this function 

int dim the dimension of the hypercube 

int direction direction to receive from 

char *buf a pointer to the head of the message 

long len the number of bytes to be passed 



426 #if def PROTOTYPE 



429 

430 void directional_receive(int node, int dim, int direction, 

431 char *buf, long len); 

432 

433 #else 

434 

435 void directional_receive() ; 

436 

437 #endif 
436 

439 

440 

441 /* ========= FUNCTION DECLARATION ========= 

442 * 

443 * PURPOSE: To send in a prescribed direction. The direction is as 

444 * described in directional_exchange() above. 

445 * 

446 * PREREQUISITE: initialize_hypercube( ) 

447 * 

446 * INCLUDE: <conc.h> (Logical Systems C, version 89.1) 

449 * "comm.h" 

450 * 



24S 



comm.h 



451 


CALLS: 


pos2() 


"mathx .h" 


452 




send( ) 




453 








454 


CALLED BY: 




455 








456 


PARAMETERS: 




457 








456 


int 


node 


the number of the node calling this function 


459 


int 


dim 


the dimension of the hypercube 


460 


int 


direction 


direction to send to 


461 


char 


♦ but 


a pointer to the head of the message 


462 


long 


len 


the number of bytes to be passed 


463 








464 

465 


/ 






466 








467 









468 # ifdef PROTOTYPE 



469 

470 void directional_send(int node, int dim, int direction, 

471 char ♦but, long len); 

472 

473 #else 

474 

475 void directional_send() ; 

476 

477 #endif 



476 

479 

460 

461 

462 

463 / 

464 

465 

466 

467 
466 
469 

490 

491 

492 

493 

494 

495 

496 

497 
496 

499 

500 



======== FUNCTION DECLARATION ======== 



PURPOSE: 


To give the Bamming distance between i and j . 


INCLUDE: 


"comm . h" 


CALLS: 


sizeof () 


CALLED BY: 




PARAMETERS: 


int i, j the numbers 


RETURNS : 


(int) the Bamming distance(i , j) . That is, the number of 
ones in the binary exclusive OR (i X0R j). 



/ 
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501 

502 

503 

504 

505 

506 

507 
506 

509 

510 

511 

512 

513 

514 

515 

516 

517 
516 

519 

520 

521 

522 

523 

524 

525 

526 

527 
526 

529 

530 

531 

532 

533 

534 

535 

536 

537 
536 

539 

540 

541 

542 

543 

544 

545 

546 

547 
546 

549 

550 



# if del PROTOTYPE 

int hamming_distance(int i, int j); 

#else 

int hamming_distance(/* int i, int j */); 
tendif 



A 

♦ 

* 

* 

* 

* 

* 

* 

* 

* 

* 

♦ 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

♦ 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 



FUNCTION DECLARATION 



PURPOSE: The initialize. hypercubeQ function creates the hypercube 

and performs the required setup for communications. It 
must be completed before you expect to communicate. On the iPSC/2, 

ONLY the host code should call this function. For transputer implemen- 
tations every node should call it (in addition to the root node). This 
is prerequisite to most of the other functions in this file. The basic 
requirements for this function axe so different (machine dependent) 
that there axe two versions: one for the transputers and one for the 

iPSC/2 machine. 



INCLUDE: 



CALLS: 



"comm. h" 

attachcubeO 
callocQ 
f ree() 
getcube( ) 
linkinQ 
linkout () 
loadQ 
mallocO 
printf () 
setpidQ 
sizeof () 
strcpyO 



(Intel iPSC/2 C Libraxy) 



(Intel iPSC/2 C Libraxy) 



(Intel iPSC/2 C Libraxy) 



(Intel iPSC/2 C Library) 



CALLED BY: 

PARAMETERS: In both cases, the desired dimension of the hypercube is 

passed in as the first argument . After this, the functions 
axe quite different. 



(1) iPSC/2 



chax *nodecode 



A pointer to the filename of the nodecode is 
required so that the function can load the node 
program . 
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551 

552 

553 

554 

555 

556 

557 
556 

559 

560 

561 

562 

563 

564 

565 

566 

567 
566 

569 

570 

571 

572 

573 

574 

575 

576 

577 
576 
579 

560 

561 



(2) transputers 

Channel *ic [(CUBESIZE + 1)] This is the incoming channel list. 
You must declare it globally. Let CUBESIZE be the number of 
transputers in the hypercube. Then ic [] is a vector of length 
(CUBESIZE + l). The indexing is such that (ic[n] == C) , where 
n is some neighbor and C is the incoming Channel* from n. For 
instance, if node k finds that ic[n] == LINK1IM then node k 
knows to receive messages from node n via LINK1IN. The element 
ic[CUBESIZE] holds the channel for the root node (if any) . 
ic[n] == MULL means that there is no connection to node n. 

Channel *oc [(CUBESIZE + l)] is the outgoing channel list. It 
is completely analogous to ic[] except that it will hold 
LINK00UT, LINK10UT, LINK20UT, or LIMK30UT for the appropriate 
node index. Your only obligation is to define these lists as 
globals in the manner shown. The Channel pointer elements will 
be filled in by init ialize_hypercube() . 

RETURNS: The iPSC/2 version of the function returns a pointer to the 

name of the cube. In the transputer environment, the cube- 
name has no meaning, so a void function suffices. For the 
transputer environment, the single most important task that 
initialize.hypercube () performs is the filling of ic[] and 
oc[] . These vectors axe used by most of the other communi- 
cations functions. 



/ 



562 #if def TRANSPUTER 



563 

564 void initialize_hypercube(int dim); 

565 

566 #else 

567 

566 chax *initialize_hypercube(/* int dim, char *nodecode ♦/) ; 

569 

590 #endif 

591 

592 

593 

594 



595 /* ========= FUNCTION DECLARATION ======= == 

596 * 

597 * PURPOSE: This function, called from any node in the hypercube, 

596 * returns the dimension of the smallest hypercube containing 

599 * that node. 

600 * 
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601 * INCLUDE: "comm.h" 

602 * 

603 * CALLS: pow2() "mathx.h" 

604 * 

605 * CALLED BY: 

606 * 

607 * PARAMETERS: int node the inquiring node 

60S * 

609 * RETURNS: For an n-cube containing P==2‘(n) processors, this function 

610 * is designed to work for nodes numbered 0 through (P-l). If 

611 * the function is called from the root (host) node, there is no guarantee 

612 * as to the returned value. If it is called by a valid node, it will 

613 * return the dimension of the smallest hypercube containing that node 

614 * number. For instance least_dimension(0) == 0, least_dimension(l) == 1, 

615 * least_dimension(2) == 2, least_dimension(3) == 2, and least_dimension 

616 * (8) == 4. 

617 * 

619 */ 

620 
621 

622 #if def PROTOTYPE 

623 

624 int least_dimension(int node); 

625 

626 #else 

627 

626 int least_dimension(/* int node */); 

629 

630 #endif 

631 

632 

633 

634 



635 /* ========= FUNCTION DECLARATIONS ======== 

636 * 

637 * PURPOSE: The receiveO and send() functions declared below provide 

636 * communication to (from) a buffer pointed to by buf . The 



639 * volume of material to send (receive) is indicated in bytes by the len 

640 * argument. The destination (origin) is given by the first argument , 

641 * using a valid node number. Suppose you have an n-cube established upon 

642 * a system with p == (2~n) node processors. Then you should refer to the 

643 * nodes of the hypercube by their node number, which is a Gray coded 

644 * value in the range [ 0, (p-l) ]. If you are at the root, of course, 

645 * you may not communicate with the root (at least not with these func- 

646 * tions) ; but if you axe at one of the nodes of the hypercube, you may 

647 * communicate with the root by using myhostO as the origin (or destina- 

646 * tion) of your message. The macro given above makes myhostO available 

649 * on the transputers. 

650 * 
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651 

652 

653 

654 

655 

656 

657 
656 

659 

660 
661 
662 

663 

664 

665 

666 
667 
666 

669 

670 

671 

672 

673 

674 



* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

♦ 



Transputers or iPSC/2? The type parameter is only used in the implied 
sense with the iPSC/2 implementation [ it becomes type or typesel for 
csend() or crecv() ]. For transputer implementations, type MUST BE set 
equal to the number of nodes in the hypercube (e.g., p in the example 
above). I have called this 'cubesize 1 in most of my references. 

PREREQUISITE: initialize_hypercube () 



IICLUDE : 
CALLS: 



<conc . h> 
"comm . h" 

ChanlnO 
ChanOut () 
crecvQ 
csendQ 



CALLED BY: 



(Logical Systems C, version 89.1) 

(Logical Systems C, version 89.1) 
(Intel iPSC/2 C Library) 



=============== CAUTION ================ 



* 

♦ Make sure type -= cubesize in the transputer case (see the note above) ! 

* 



*/ 



675 #if def PROTOTYPE 



676 

677 void receive(int origin, char *buf , long len, long type); 

676 

679 void send(int destination, char *buf , long len, long type); 

660 

661 #else 

662 

663 void receive(/* int origin, char *buf , long len, long type */); 

664 

665 void send(/* int destination, char *buf , long len, long type */); 



666 

667 #endif 

666 

669 

690 /* == === = === FUNCTION DECLARATION = = ======= 

691 * 

692 * PURPOSE: This function is called from the nodes to submit a message 

693 * to the next lower dimension. If it is called from the host 



694 * (root) it has no effect. When it is called from node zero, the trans- 

695 * mission is directed to the root/host. Vhen called from any other node, 

696 * the information in buf is passed to the proper node in the next lower 

697 * dimension. The lower dimension must have an accepting coalesce() or 

696 * other receiving function [ coalesceQ and submitQ are meant to be used 

699 * in a balanced fashion, where each submit() or group of submitQ's in 

700 * one dimension is matched by a coalesceQ in the next lower dimension ] . 




comin.il 



701 

702 

703 

704 

705 

706 

707 
706 

709 

710 

711 

712 

713 

714 

715 

716 

717 
716 

719 

720 

721 

722 

723 

724 

725 

726 

727 
726 

729 

730 

731 

732 

733 

734 

735 

736 

737 
736 

739 

740 

741 

742 

743 

744 

745 

746 

747 



* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* • 
*/ 



PREREQUISITE : init ialize_hypercube ( ) 

IECLUDE: <conc.h> (Logical Systems C, version 89.1) 

"comm.h M 

CALLS: least_dimension() 

pow2() "mathx.h , ‘ 

send( ) 

CALLED BY: 



EXCEPTIONS: Again, we have the hybrid hypercube in the transputer case 
(see many comments above). The general rule is changed in 
this case since node 1 submitQs to the root and not node 0. This is 
the only change . 

SPECIFICS: If you need to determine exactly where a submitQ will go, 

you can figure it out in the following manner [ with the 
obvious EXCEPTIONS (the previous paragraph) ] .... 

Suppose you are 'at' node i in an n-cube (p processors = 2~n). You 
must submitO information to the (unique) node, j, that satisfies two 
requirements : 

(1) hamming_distance(i , j) == 1 

(2) least_dimension(i) == (least_dimension( j ) + 1) 

So, for instance, consider a 4-cube where i == 12. It should be fairly 
easy to see that j will be node 4. This is because these two nodes are 
adjacent and they axe one dimension apaxt in the cube (i.e., node 4 
first appears in a 3-cube and node 12 first appears in a 4-cube). 



PARAMETERS: 



int node 
int dim 
chax *buf 
long len 
long type 



the sending node 
the dimension of the hypercube 
a pointer to the head of the message 
the number of bytes to be passed 

the type of the message (iPSC/2 applications only), or 
cubesize in the transputer case. 



746 #if def PROTOTYPE 



749 

750 void submit (int node, int dim, char *buf , long len, long type); 



com ni. h 



751 

752 #else 

753 

754 void submit (/* int node, 

755 

756 #endif 

757 
756 

759 /* = = = = = = = = = = = = = = 



int dim, char *buf , long len, long type */) ; 



EOF comm.h 



*/ 
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1 /* ========== PROGRAM information = = = ==== === 

2 * 

3 * SOURCE : complex. h 

4 ♦ VERSION : 1.6 

5 * DATE : 09 September 1991 

6 * AUTHOR Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 ♦ 

6 * ==== === ======= REFERENCES ===== =========== 

9 ♦ 

10 * [l] Goldberg, David. ' 'What Every Computer Scientist Should Know About 

n * Floating-Point Arithmetic ** . ACM Computing Surveys, Vol . 23, 

12 * No. 1, March 1991. 

13 ♦ 

14 * 

15 ♦ ============== DESCRIPTION =============== 

16 * 



17 * This file contains the definition of Complex_Type and declarat ions of 
is * functions that perform operations with complex numbers: 

19 * 

20 * cadd() 

21 * cdiv() 

22 * cmul() 

23 * csub() 

24 * Im() 

25 * Re ( ) 

26 * 

26 */ 

29 

30 

31 

32 

33 

34 /♦ ======= = === TYPE DEFINITION = = = = ======= ♦/ 

35 

36 

37 typedef struct { 

36 

39 double x, /* real part */ 

40 y; /* imaginary part ♦/ 

41 

42 } Complex_Type ; 

43 

44 

45 

46 

47 

46 /♦ = = = = = === = = FUNCTION DECLARATION = ===== ==== 

49 * 

50 * PURPOSE: To add two complex numbers, zl and z2, and place their sum 



256 



complex. h 



51 * in the Complex_Type '♦sum 1 . 

52 * 

53 * INCLUDE: "complex. h" 

54 * 

55 * PARAMETERS: The parameters give the two operands zl and z2. and a 

56 * pointer to the result, sum. 

57 ♦ 

58 * EXAMPLE: Complex_Type zl, z2, z3; 

59 * 

60 * cadd(zl, z2, Jtz3 ) ; 

61 * 

63 ♦ / 

64 

65 #ifdef PROTOTYPE 

66 

67 void cadd(Complex_Type zl, Complex_Type z2, Complex.Type ♦sum); 

68 

69 #else 

70 

71 void cadd() ; 

72 

73 #endif 

74 

75 

76 

77 

78 



79 /♦ = = === = = = = = FUNCTION DECLARATION ========= = 

80 * 

61 * PURPOSE: To divide two complex numbers, (zl / z2) , and place the 

62 * result in the Complex_Type '♦quotient'. 

63 ♦ 

64 ♦ ALGORITHM: The code uses Smith's formula (page 25 of [l]) to perform 

85 * the division. 

66 * 

67 ♦ INCLUDE: "complex. h" 

86 * 

89 ♦ PARAMETERS: The parameters give the two operands zl and z2, and a 

90 ♦ pointer to the result, quotient. 

91 ♦ 

92 ♦ EXAMPLE: Complex_Type zl , z2, z3; 

93 ♦ 

94 ♦ cdiv(zl, z2, Az3) ; 

95 ♦ 

97 ♦ / 

96 



99 #if def PROTOTYPE 
100 
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101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 
1 2ft 

129 

130 

131 



void cdiv(Complex_Type zl , Complex_Type z2 , Complex.Type ♦quotient); 
#else 

void cdiv(); 

#endif 



FUNCTION DECLARATION 



PURPOSE: To multiply two complex numbers, zl and z2, and place their 

product in the Complex.Type ' ♦product 1 . 

INCLUDE: "complex. h M 

PARAMETERS: The parameters give the two operands zl and z2, and a 
pointer to the result, product. 

EXAMPLE: Complex.Type zl, z2 , z3; 

cmul(zl, z2, Jtz3) ; 



#if def PROTOTYPE 



132 

133 void cmul (Complex_Type zl, Complex_Type z2, Complex_Type ♦product); 

134 

135 #else 

136 

137 void cmulQ; 

136 

139 #endif 

140 

141 

142 

143 

144 



145 /♦ ==== ==== = = FUNCTION DECLARATION = ========= 

146 ♦ 

147 ♦ PURPOSE: To place the difference of two complex numbers, (zl - z2) , 

146 ♦ into the Complex_Type '♦difference 1 . 

149 ♦ 

150 ♦ INCLUDE: M complex.h M 
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151 

152 

153 

154 

155 

156 

157 
156 

159 

160 
161 
162 



PARAMETERS: The parameters give the two operands zl and z2, and a 
pointer to the result, difference. 

EXAMPLE: Complex_Type zl, z2, z3; 

csub(zl , z2 , Az3) ; 



/ 



163 #if def PROTOTYPE 



164 

165 void csub(Complex_Type zl, Complex.Type z2, Complex_Type *dif f erence) ; 

166 

167 #else 

166 

169 void csub() ; 

170 

171 #endif 

172 

173 

174 

175 

176 

177 / 

176 
179 
160 
161 
162 

163 

164 

165 

166 
167 
166 
169 

190 */ 

191 

192 #ifdef PROTOTYPE 

193 

194 double Im(Complex_Type z) ; 

195 

196 #else 

197 

196 double Im(); 

199 

200 # end if 



========== FUNCTION DECLARATION ========== 

PURPOSE: To return the imaginary part of a complex number, z. 

PARAMETERS: The complex number, z, is passed into Im(). 

RETURNS: The imaginary part of z as type double; that is a real 

number y so that y * sqrt(-l) [or iy] is the imaginary part 
of z . 

EXAMPLE: y = Im(z) ; 
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201 

202 

203 

204 

205 

206 /♦ ========== FUKCTION DECLARATIOH ========== 

207 * 

206 * PURPOSE: This function returns the real part of a complex number, z. 

209 * 

210 * PARAMETERS: The complex number, z, is passed into Re(). 

211 * 

212 * RETURHS: The real pant of z as type double. 

213 * 

214 * EXAMPLE: x = Re(z); 

215 * 

216 * == = = = = = = = = = = = === =============== = === = = === = = ==== 

217 */ 

216 

219 

220 #if def PROTOTYPE 

221 

222 double Re (Complex_Type z) ; 

223 

224 #else 

225 

226 double Re(); 

227 

226 #endif 

229 

230 

231 /* ============ EOF complex. h ============ */ 
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complex. c 



1 /+ ========== PROGRAM INFORMATION ========== 

2 * 

3 * SOURCE : complex. c 

4 ♦ VERSION : 1.6 

5 * DATE : 09 September 1991 

6 * AUTHOR : Jonathan E. Hartman , U. S. Naval Postgraduate School 

7 * DETAILS : See "complex . h" . 

6 * 

10 */ 
li 



12 #include <stdio.h> 

13 #include "complex. h" 

14 

15 

16 
17 
16 

19 /* ========= FUNCTION DEFINITION = ======== */ 

20 
21 

22 #if del PROTOTYPE 

23 

24 void cadd(Complex_Type zl, Complex_Type z2, Complex_Type *sum) 

25 

26 #else 

27 

26 void cadd(zl, z2, sum) 

29 

30 Complex_Type zl # 

31 z2 , 

32 *sum; 

33 

34 #endif 

35 { 

36 

37 8um->x = zl.x + z2.x; 

36 8um->y = zl.y + z2.y; 

39 

40 > 

41 /* End cadd() */ 

42 

43 

44 

45 

46 

47 /♦ ========= FUNCTION DEFINITION ========= ♦/ 

46 

49 

50 tifdef PROTOTYPE 
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51 

52 

53 

54 

55 

56 

57 
56 

59 

60 
61 
62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

75 

76 

77 
76 

79 

80 
81 
62 

63 

64 

65 

66 



void cdiv(Complex_Type zl, Complex.Type z2, Complex_Type ♦quotient) 
#else 

void cdiv(zl, z2, quotient) 

Complex_Type zl , 
z2, 

♦quotient ; 

#endif 

{ 

double d; 

if (fabs(z2.y) < fabs(z2.x)) { 
d = (z2 . y / z2 . x) ; 



quotient->x 

quotient->y 

> 

else { 



((zl.x + zl.y * d)/(z2.x + z2.y ♦ d)); 
((zl.y - zl.x * d)/(z2.x + z2.y ♦ d)); 



d = (z2 . x / z2.y) ; 

quotient->x = (( zl.y + zl.x * d)/(z2.y + z2.x ♦ d)); 
quotient->y = ((-zl.x + zl.y * d)/(z2.y + z2.x * d)); 



/* End cdiv() 



67 

88 /♦ ========= FUNCTION DEFINITION == 

69 

90 

91 #if del PROTOTYPE 



92 

93 void cmul(Complex_Type zl, Complex_Type z2, Complex_Type ♦product) 

94 

95 #else 

96 

97 void cmul(zl, z2, product) 

96 

99 Complex_Type zl, 

ioo z2 , 



*/ 



♦/ 
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101 ♦product; 

102 #endif 

103 { 

104 

105 product->x = (zl.x ♦ z2.x - zl.y ♦ z2.y); 

106 product->y = (zl.x ♦ z2.y + zl.y ♦ z2.x); 

107 > 

106 /♦ End cmul() ♦ / 



109 

110 

111 

112 

113 

114 /♦ == ======= FUNCTION DEFINITION ========= ♦ / 

115 

116 

117 #if def PROTOTYPE 
116 

119 void csub(Coraplex_Type zl, Complex_Type z2, Complex_Type ♦difference) 

120 

121 #else 

122 

123 void csub(zl, z2 ( difference) 

124 

125 Complex_Type zl, 

126 z2, 

127 ♦difference; 

126 #endif 

129 { 

130 

131 dif f erence->x = zl.x - z2.x; 

132 diff erence->y = zl.y - z2.y; 

133 

134 > 

135 /♦ End csub() ♦/ 

136 

137 
136 

139 

140 

141 /♦ ========= FUNCTION DEFINITION ========= ♦/ 

142 

143 

144 #if def PROTOTYPE 

145 

146 double Im(Complex_Type z) 

147 

146 #else 

149 

150 double Im(z) 
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151 

152 Complex.Type z; 

153 

1 54 #endi 1 

155 { 



156 

157 

158 

159 > 

160 /* 
161 
162 

163 

164 

165 

166 /* 

167 

168 



return(z .x) ; 



End Im( ) 



169 Uiidef PROTOTYPE 



FUNCTION DEFINITION ======== 



170 

171 double Re (Complex. Type z) 

172 

173 #else 

174 

175 double Re(z) 

176 

177 Complex.Type z; 

178 

179 #endif 

180 { 

181 

182 return(z.y); 

183 

184 > 

185 /* End Re ( ) 

186 

187 

188 /* ============ EOF complex, c 



*/ 



*/ 



*/ 



*/ 
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1 /♦ = = = ======= PROGRAM information = = ==== = = = = 

2 * 

3 * SOURCE : epsilon. h 

4 * VERSION : 1.7 

5 * DATE : 09 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 ♦ 

6 * 

9 * ============== REFERENCES ============== 

10 * 

11 * [1] Gragg, William B. Personal conversations, course notes, and MATLAB 

12 ♦ code , 1991 . 

13 ♦ 

14 * 

15 * === = ======== = = DESCRIPTION ============= 

16 * 



17 * This file contains declarations of functions that determine the machine 

is * precision for a particular machine. The definition of epsilon is given 

19 * below. 

20 * 

21 * 

22 * === ======== LIST of functions ============ 

23 * 

24 * epsdQ 

25 * epsfO 

26 ♦ 

26 */ 

29 

30 

31 

32 

33 



34 /* ========== FUNCTION DECLARATION ====== ==== 

35 * 

36 * PURPOSE: To find the machine precision. The machine precision, eps, 

37 * is defined as the largest number which satisfies: 

36 * 

39 * 1.0+ eps == 1.0 

40 * 



41 * This program uses the type "double" which normally means an 8-byte 

42 * (64-bit) floating-point number stored in the IEEE 754 double precision 

43 * standard representation of [ 1 sign bit ][ 11-bit exponent ][ 52-bit 

44 * mantissa/signif icand ] . 

45 * 

46 * INCLUDE: "epsilon. h" 

47 * 

46 * RETURNS: The value of epsilon (double). 

49 * 
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51 */ 

52 



53 


double epsd(); 






54 










55 










56 










57 










58 


/* . 








59 


/ * 








60 


* 








61 


* 


PURPOSE: 


This function is identical to epsdQ except 


that it returns 


62 


♦ 




type float. Note: The values returned may 


be identical, 


63 


* 




probably reflecting C arithmetic done in type double 


64 


♦ 




regardless of the ultimate type returned. Anyway, this 


65 


♦ 




function does everything using type float. 




66 


* 








67 


* 


INCLUDE : 


"epsilon.h" 




68 


* 








69 


* 


RETURNS: 


The value of epsilon (float). 




70 


* 








71 


* - 


— 


ii 

ii 

ii 

ii 

n 
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n 
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n 
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n 
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n 

ii 

n 

ii 

n 

n 

ii 

n 

n 

ii 

i 

i 


=== 


72 


*/ 








73 










74 


float epsfQ; 






75 










76 

77 


/* ■ 






*/ 
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generate, h 



1 / 

2 

3 

4 

5 

6 
7 
6 
9 

10 

11 

12 

13 

14 

15 

16 
17 
16 

19 

20 
21 
22 
23 



25 

26 
27 
26 



29 */ 



30 

31 

32 /* 

33 * 

34 * 

35 * 

36 * 

37 * 

36 * 

39 * 

40 * 

41 * 

42 * 

43 * 

44 * 

45 * 

46 * 

47 * 

48 * 

49 * 

50 * 



SOURCE 

VERSION 

DATE 

AUTHOR 



generate . h 
1.7 

09 September 1991 

Jonathan E. Hartman, U. S. Naval Postgraduate School 



REFERENCES 



[l] Gragg, William B. Personal conversations, course notes, and MATLAB 
codes, 1991. 



============== DESCRIPTION =============== 

Declarations of matrix and vector generation/initialization functions. 



======== LIST OF FUNCTIONS =========== 



hilbert ( ) 
identity ( ) 

in i t ial.permut at ion_ vector () 
mxrand( ) 
wilkinsonQ 
zeros () 



======== FUNCTION DECLARATION ========== 



PURPOSE: This function generates a Hilbert matrix of the specified 

size. The function takes care of memory allocation, so 
the caller does not need to do this. The definition used 
for a Hilbert matrix is (for rows and columns numbered from 
1) that the element at the (i,j) position has the value 

( 1/ ( i + j - D). 



INCLUDE: "allocate. h" 

"matrix .h" 



CALLS: matallocQ 

CALLED BY: 



PARAMETERS: The parameters tell the size of the desired matrix. 

RETURNS: On success (i.e. no allocation problems), hilbertQ returns 



2G7 




generate. h 



51 * the allocated matrix filled with the values as described. 

52 * A FULL return value flags an allocation failure. 

53 * 

54 * EXAMPLE: Double. Matrix. Type *A = hilbert(5, 7); 

55 * 

57 */ 

56 

59 #if def PROTOTYPE 



60 








61 




Double.Matrix.Type *hilbert(int rows, int cols); 


62 








63 


#else 




64 








65 




Double.Matrix.Type *hilbert(); 


66 








67 


#endif 




66 








69 








70 








71 








72 


/* 






73 


/ * 






74 


* 






75 


* 


PURPOSE: 


This function generates an Identity matrix of the specified 


76 


* 




size. The function takes care of memory allocation, so 


77 


* 




the caller does not need to do this. 


76 


* 






79 


* 


IFCLUDE : 


"allocate . h" 


60 


* 




"matrix .h" 


61 


* 






62 


* 


CALLS : 


matalloc () 


63 


* 






64 


* 


CALLED BY: 




65 


* 






66 


* 


PARAMETERS: 


The parameters tell the size of the matrix. 


67 


* 






66 


* 


RETURNS: 


On success (i.e., no allocation problems), identityO 


69 


* 




returns the allocated matrix filled with the ones on the 


90 


* 




diagonal. A FULL return value flags an allocation failure. 


91 


* 






92 


* 


EXAMPLE: 


Double.Matrix.Type *A = identity(5, 7); 


93 

(\A 


* 

* 






95 


*/ 




96 








97 








96 


#if def PROTOTYPE 



99 

ioo Double.Matrix.Type *identity(int rows, int cols); 
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101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 
126 

129 

130 

131 

132 

133 

134 

135 

136 

137 
136 

139 

140 

141 

142 

143 

144 

145 

146 

147 
146 

149 

150 



#else 

Double_Matrix_Type ♦identityO; 
#endif 



FUNCTION DECLARATION 



PURPOSE: 



/* - 

* 

* 

♦ 

* 

* 

* 

* 

* 

* 

♦ 

♦ 

* 

* 

* 

* 

* 

♦ 

* - 

*/ 



#if del PROTOTYPE 

int *initial_permutat ion_vector ( int n) ; 
#else 

int *initial_permutation_vector() ; 
#endif 



To initialize a permutation vector, p[] . This function 
performs allocation for p[] , assuming that it must contain 
n integer elements. Additionally, the function assigns 
values p[j] = j for all 0 <= j < n. If allocation fails, p 
will be NULL upon return. 

INCLUDE: "allocate . h" 

CALLS: intvecallocQ 

CALLED BY: 

PARAMETERS: The size of the vector, n. 

RETURNS: (A pointer to) The vector. 



/* 

* 

* PURPOSE: 

* 



FUNCTION DECLARATION 



This function generates a matrix whose elements are pseudo- 
random numbers (generated by lcdrandO in mathx.c). 
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151 

152 

153 

154 

155 

156 

157 

158 

159 

160 
161 
162 

163 

164 

165 

166 

167 

168 

169 

170 

171 

172 

173 



IMCLUDE: “allocate . h" 

"mathx . h” 
“matrix .h" 



CALLS : 



lcdrandO 

matallocO 



CALLED BY: 

PARAMETERS: The parameters tell the size o f the matrix. 

RETURNS: On success (i.e., no allocation problems), mxrandO returns 

the allocated matrix filled with the random values. A NULL 
return value flags an allocation failure. 

EXAMPLE: Double_Matrix_Type *A = mxrand(5, 7 ); 



#ifdef PROTOTYPE 



174 

175 Double_Matrix_Type *mxrand(int rows, int cols); 

176 

177 #else 

178 

179 Double.Matrix.Type *mxrand(); 

180 

181 #endif 

182 

183 

184 

185 

186 



187 /* ========== FUNCTION DECLARATION ========== 

188 * 

189 * PURPOSE: This function generates a Wilkinson matrix of the specified 

190 * size. The function takes care of memory allocation, so 

191 * the caller does not need to do this. The definition used 

192 * for a Wilkinson matrix is: ones along the diagonal, ones 

193 * along the rightmost column, zeros in the upper right 

194 * triangle, and (-l)'s in the lower left triangle. 

195 * 

196 * [1 1 ] 

197 * [-11 1 3 

198 * [-1-11 1 ] 

199 * [-1-1-11 1 3 

200 * 
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201 




202 




203 




204 


INCLUDE: 


205 




206 




207 


CALLS: 


206 




209 


CALLED BY: 


210 




211 


PARAMETERS: 


212 




213 


RETURNS : 


214 




215 




216 




217 




216 


EXAMPLE: 


219 




220 




221 


/ 


222 





"allocate . h" 
"matrix. h" 

matallocO 



The parajneters tell the size of the matrix. 

On success (i.e. no allocation problems), wilkinsonO 
returns the allocated matrix filled with the values as 
described. On (allocation) failure, wilkinsonO returns 
NULL. 

Double_Matrix_Type *A = wilkinson(5, 7); 



223 #if def PROTOTYPE 



224 

225 Double. Matrix.Type *wilkinson(int rows, int cols); 

226 

227 #else 
226 

229 Double.Matrix.Type *wilkinson( ) ; 

230 

231 #endif 

232 

233 

234 

235 

236 



237 /* = = = === = === FUNCTION DECLARATION ====== ==== 

236 * 

239 * PURPOSE: This function generates a matrix of the specified size, 

240 * where all of the entries axe zero. 

241 * 

242 * INCLUDE: "allocate .h" 

243 * "matrix, h" 

244 * 

245 * CALLS: matallocO 

246 * 

247 * CALLED BY: 

246 * 

249 * PARAMETERS: The parameters tell the size of the matrix. 

250 * 




generate. h 



251 

252 

253 

254 

255 

256 

257 

258 

259 

260 
261 
262 

263 

264 

265 

266 
267 
266 

269 

270 

271 



RETURNS: On success (i.e. no allocation problems), zeros () returns 

the allocated matrix filled with zeros. On allocation 
failure, zeros () returns NULL. 

EXAMPLE: Double_Matrix_Type ♦A = zeros(5, 7); 



/ 

#ifdef PROTOTYPE 

Double_Matrix_Type ♦zeros (int rows, int cols); 
#else 

Double_Matrix_Type ♦zeros(); 

#endif 



71 /♦ 



EOF generate. h 



*/ 
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PROGRAM INFORMATION 



io.h 



1 /* ==== 

2 * 

3 * SOURCE : io.h 

4 * VERSION : 2.2 

5 * DATE : 09 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 * 

6 * 

9 * === = === ==== === DESCRIPTION = = ===== ====== 

io * 

n * This file contains declarations of functions for matrix and vector 

12 * input/output. The matrix structures such as "Double_Matrix_Type" are 

13 * given in "matrix. h". 

14 * 

15 * The following parameters are common enough to justify a one-time 

16 * explanation here (and not with each occurrence below): 

17 * 

16 * width the width in which to print a value 

19 ♦ 

20 * aft the number of places to print after the decimal point 

21 * 

22 * 

23 * === ======== LIST OF FUNCTIONS ====== ====== 

24 * 

25 * answerO 

26 * f ill_matrix() 

27 * f read_matrix() 

26 * fwrite_matrix() 

29 ♦ getintO 

30 * get_matrix_size() 

31 * pauseO 

32 * printmdO 

33 * printvaO 

34 * printviQ 

35 * 

37 */ 

36 

39 

40 /* ======= === MANIFEST CONSTANTS ========== */ 

41 

42 



49 

50 



273 




FUNCTION DECLARATION 



io.h 



51 /« 


► 


52 




53 


PURPOSE: 


54 




55 


NOTE: 


56 




57 




56 




59 


INCLUDE: 


60 




61 




62 


CALLS: 


63 




64 


CALLED BY: 


65 




66 


PARAMETERS 


67 




66 


RETURNS: 


69 




70 

71 


/ 


72 




73 




74 int answerQ; 


75 




76 




77 




76 




79 




60 

81 /* 








62 * 


63 * PURPOSE: 


64 * 


65 * 


66 * 


► 


67 * 


66 * PARAMETERS 


69 * 


► 


'X> 

o 

* 


91 * 


► INCLUDE: 


92 * 


93 * 


94 ♦ CAUTION: 


95 * 


¥ 


96 * 


97 * 


96 * 


99 * CALLS: 


100 X 


¥ 



To get a yes or no answer from the user. 

This function includes the prompt "(y/n)? " so you do not 

have to include this in your query. There is no space 
before, two spaces after, and no newline (i.e. as shown). 

<stdio ,h> 

"io . h" 



getchar () 



<Btdio.h> 



void . 

(int) YES or NO (as defined in matrix. h) . 



FUNCTION DECLARATION 



A function which prompts the user for the pertinent data 
about a matrix and fills the structure provided with the 
appropriate information. That is, this function allows the 
user to input the values of the elements. 

A pointer to the structure containing the matrix to be 
filled. 

<stdio .h> 

"io .h" 

This function ASSUMES that the "rows" and "cols" fields 
have been correctly assigned by something like matallocQ 
[see "allocate .h"] and makes no effort to enter a value in 
those fields of the matrix structure. 



() 
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CALLED BY: 



io.h 



101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 
126 

129 

130 

131 

132 

133 

134 

135 

136 

137 
136 

139 

140 

141 

142 

143 

144 

145 

146 

147 
146 

149 

150 



* 

* 

* PARAMETERS: The parameters tell the size of the matrix. 

* 

* RETURNS: The matrix associated with A is operated on during the 

* execution of the function, and the result is available 

* upon return. 



* EXAMPLE: 



if ( ! f ill_matrix (A A) ) . . . . 



* - 

*/ 



tifdef PROTOTYPE 

void f ill_matrix(Double_Matrix_Type *A) ; 
#else 

void fill_matrix() ; 

#endif 



FUNCTION DECLARATION 



PURPOSE: 



INCLUDE: 



CAUTION: 



CALLS : 



A function which reads data from a file and stores it in 
the matrix of A. This function takes care of matrix 
allocation for the caller. 

<stdio .h> 

"io.h" 

This function ASSUMES the file has been stored in the 
format described in "matrix . fmt" . 

f gets ( ) 
fscanf ( ) 
rewindO 



CALLED BY: 

PARAMETERS: The pointer to the matrix structure and the file pointer. 
RETURNS: 1 on success and 0 on any sort of failure. 
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io.h 



151 

152 

153 

154 

155 

156 

157 

158 

159 

160 
161 
162 

163 

164 

165 

166 

167 

168 

169 

170 

171 

172 

173 

174 

175 

176 

177 

178 

179 

180 
181 
182 

183 

184 

185 

186 

187 

188 

189 

190 

191 

192 

193 

194 

195 

196 

197 

198 

199 

200 



#ifdef PROTOTYPE 

int fread_matrix(Double_Matrix_Type **A, FILE *fp) ; 
#else 

int fread_matrix() ; 

#endif 



FUNCTION DECLARATION 



PURPOSE: A function which writes data from A->matr ix □ [] to a file 

pointed to by fp. 

INCLUDE: <stdio.h> 

"io.h" 

ASSUMPTION: The caller has already performed fopen() on fp for the 
"w" (write) mode. 



CALLS: 



CALLED BY: 



fprintf () 
rewindQ 



A 

* 

* 

* 

* 

* 

* 

♦ 

* 

* 

* 

* 

* 

* 

♦ 

* 

* 

* 

* 

* 

* 

* - 

*/ 



#if def PROTOTYPE 

int fwrite_matrix(Double_Matrix_Type *A, FILE *fp, int width, int aft); 
#else 

int fwrite_matrix() ; 

#endif 



PARAMETERS: A is a pointer to the structure which contains the matrix, 
fp is a FILE pointer. 

RETURNS: 1 on success and 0 on failure. 
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FUNCTION DECLARATION 



io.h 










201 


/ * 






202 


* 






203 


* 


PURPOSE: 


A function 


204 


* 






205 


* 


INCLUDE: 


<8tdio . h> 


206 


* 




"io.h" 


207 


* 






206 


* 


CALLS: 


fflushO 


209 


* 




scanf () 


210 


* 






211 


* 


CALLED BY: 




212 


* 






213 


* 


RETURNS: 


The user's 



* 

* - 
*/ 



214 

215 

216 
217 

2is int getint O ; 

219 

220 
221 
222 

223 

224 

225 

226 
227 
226 

229 

230 

231 

232 

233 

234 

235 

236 

237 
236 

239 

240 

241 

242 

243 

244 

245 

246 

247 
246 

249 

250 



-========== FUNCTION DECLARATION ========== 

A function to ask the user for the size of a matrix. 

<stdio .h> 

"io.h M 

answerQ 
fflushO 
scanf ( ) 



/* 

* 

* PURPOSE: 

♦ 

* INCLUDE: 

* 

* 

* CALLS: 

* 

* 

* 

* CALLED BY: 

* 

* PARAMETERS: Pointers to the size of the matrix (m rows by n columns), 

♦ 

*/ 

#ifdef PROTOTYPE 

void get_matrix_size(int *m, int *n) ; 

#else 

void get_matrix_size() ; 

#endif 
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FUNCTION DECLARATION 



io.h 



251 


/* - 




252 


♦ 




253 


* 


PURPOSE: 


254 


♦ 




255 


* 


INCLUDE: 


256 


* 




257 


* 




256 


* 


CALLS: 


259 


* 




260 


* 




261 


♦ 




262 






263 


*/ 




264 






265 


void pause ( ) ; 


266 






267 






266 


/ * 




269 


* 




270 


* 


PURPOSE: 


271 


♦ 




272 


* 




273 


♦ 


INCLUDE: 


274 


* 




275 


* 




276 


* 


CALLS: 


277 


* 




276 


* 


PARAMETERS: 


279 


♦ 




260 


* 




261 


* 




262 


* 


EXAMPLE: 


263 


* 




264 


* 




265 


* 




266 






267 


*/ 




266 







Press a key to continue! 

<stdio .h> 

"io.h” 

f f lush() 
getchar ( ) 
print! () 



========== FUNCTION DECLARATION ========== 

This function provides a printout of the information stored 
in the structure A. 

<stdio.h> 

”io.h” 

print! () 

A is the structure that contains the matrix to be printed. 
The width and aft values axe described near the top of this 
file. The defaults axe defined as manifest constants. 

Double_Matrix_Type ♦A = hilbert(7, 5); 

printmd(*A , LONG.VIDTH, L0NG_AFT) ; 



269 #ifdef PROTOTYPE 



290 

291 void printmd(Double_Matrix_Type A, int width, 

292 

293 #else 

294 

295 void printmdO; 

296 

297 #endif 

296 

299 

300 



int aft); 
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FUNCTION DECLARATION 



io.h 



301 /* 

302 * 

303 * 

304 * 

305 * 

306 * 

307 * 

306 * 

309 * 

310 * 

311 * 

312 * 

313 * 

314 * • 

315 */ 

316 

317 



PURPOSE: This function prints the vector, v, of doubles. 

INCLUDE: <stdio.h> 

M io.h” 

CALLS: print! () 

CALLED BY: 

PARAMETERS: v is the vector. size is the number of elements in vD. 



316 #if def PROTOTYPE 



319 

320 void printvd(double *v, int size, int width, int aft); 

321 

322 #else 

323 

324 void printvdO ; 

325 

326 #endif 

327 
326 

329 

330 

331 



332 /* === ======= FUNCTION DECLARATION ========== 

333 * 

334 * PURPOSE: This function provides a printout of the integer vector v. 

335 * 

336 * INCLUDE: <stdio.h> 

337 * "io.h" 

336 * 

339 * CALLS: printf() 

340 * 



341 * CALLED BY: 

342 * 

343 * PARAMETERS: v is a vector of size integers. 

344 * 

346 */ 

347 

346 #ifdef PROTOTYPE 



349 

350 void printvi(int *v, int size, inr width); 




io.h 



351 

352 #else 

353 

354 void printvi(); 

355 

356 #endif 

357 
356 

359 

360 

361 /♦ = = : 



EOF 



io . h 
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mathx.h 



1 /+ ========== PROGRAM INFORMATION ========== 

2 * 

3 * SOURCE mathx.h 

4 * VERSION : 1.2 

5 * DATE : 09 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 * 

8 * 

9 * == = =========== REFERENCES ===== === = = ==== 

10 * 



11 * [1] Knuth, Donald E. The Art of Computer Programming, Volume 2: Semi- 

12 * numerical Algorithms. Addison-Vesley Publishing Company, 

13 * Reading, MA, 1969, pp . 9-24. 

14 * 

15 * [2] Sedgewick, Robert. Algorithms, Second Edition. Addison-Vesley 

16 * Publishing Company, Reading, MA , 1988, pp. 513-514. 

17 * 

is * 

19 * ============== DESCRIPTION ============= 

20 * 

21 * A small extension to the usual C <math.h>. 

22 * 

23 * 

24 * =========== LIST 0F functions = = = === = = === = 

25 * 

26 * lcdrandQ 

27 * lclrandQ 

26 * multmodO 

29 * pow2() 

30 * 

32 +/ 

33 

34 

35 

36 

37 

36 /* ========== MANIFEST CONSTANTS ========== */ 

39 



40 


#ifndef 


EXIT_FAILURE 














41 


#def ine 


EXIT.FAILURE 


-1 












42 


#endif 
















43 


















44 


#def ine 


START 


1234567 


/* 


starting value, Xo . 


See 


[i] 


*/ 


45 


#def ine 


MULT 


31415821 


/* 


multiplier, a. 


See 


[i] 


*/ 


46 


#def ine 


INCR 


1 


/* 


increment, c. 


See 


[i] 


*/ 


47 


#def ine 


SQRTM 


10000 


/* 


sqrt (m) 






*/ 


46 


#define 


MODULUS 


100000000 


/* 


modulus, m. 


See 


[i3 


*/ 



49 

50 
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FUNCTION DECLARATION 



mathx.h 



51 / 

52 

53 

54 

55 

56 

57 
56 

59 

60 
61 
62 

63 

64 

65 

66 
67 
66 
69 



PURPOSE: To calculate a pseudo-random number in the range [0, 1] 

using the linear congruential method. This function is a 
very simple application of lclrandQ. It merely divides 
the value that lclrandQ returns by the modulus, and 
returns the resulting double value. 

INCLUDE: "mathx.h” 



CALLS: lclrandQ 

CALLED BY: mxrandQ "generate . c" 



PARAMETERS: The parameters are identical to those for lclrandQ. 



RETURNS: A pseudo-random double value in the range [ 0.0, 1.0 ]. 



EXAMPLE: double d; 



70 

71 

72 

73 

74 

75 

76 




d = lcdrand( START, MULT, INCR, SQRTM, MODULUS); 



77 #if def PROTOTYPE 



76 

79 double lcdrand(long Xn, long a, long c, long sqrtm, long m) ; 

60 

61 #else /* iPSC/2 */ 

62 

63 double lcdrand(/* long Xn, long a, long c, long sqrtm, long m ♦/) ; 

64 

65 #endif 

66 
67 
66 
69 
90 



91 /* ========= FUNCTION declaration ========= 

92 * 

93 * PURPOSE: To calculate a pseudo-random number of type long in the 

94 * range [0, (m-1)] , where m is the argument for modulus. The 

95 * algorithm uses the linear congruential method. This method 

96 * is given in great detail in [l] . A shorter, algorithmic 

97 * treatment is given in [2] . I have tested the function to 

96 * be sure that it produces the ten numbers listed on page 513 

99 * of [2] . 

100 * 
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mathx.h 



101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 
126 

129 

130 

131 

132 

133 



INCLUDE : "mathx.h" 



CALLS: multinod () 

CALLED BY: lcdrandQ 

PARAMETERS: The notation comes from [1] (more-or-less) . Xn is the 

starting value. a is the multiplier, c is the increment, 
sqrtm is the square root o f », which is the modulus. A 
negative value for any of the arguments is impossible and 
will invoke the defaults given among the manifest constants 
above. The starting value, Xn, is the exception. If you 
supply a nonnegative value, your value will be accepted as 
the starting value. Else, the starting value BEGINS at the 
default START and is changed each time the function is 
called (as long as the starting value argument, Xn, is 
negative). That is, Xn HAS MEMORY as long as your program 
is running. The other parameters are determined from call- 
to-call . 



RETURNS: A pseudo-random long in the range [ 0, (m-1) ], where m is 

the modulus argument. 

EXAMPLE: This example illustrates the use of the default values: 

long 1; 

1 = lclrand (START, MULT, INCR, SQRTM, MODULUS); 



/ 



134 #ifdef PROTOTYPE 



long 



135 

136 

137 

136 telse /* 

139 

140 long 

141 

142 #endif 

143 

144 

145 

146 

147 

148 

149 

150 



lclrand(long Xn, long a, long c, long sqrtm, long m) ; 
iPSC/2 ♦/ 

lclrand(/* long Xn, long a, long c, long sqrtm, long m */) ; 
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mathx.h 



151 

152 

153 

154 

155 

156 

157 

158 

159 

160 
161 
162 

163 

164 

165 

166 

167 

168 

169 

170 

171 

172 

173 

174 

175 

176 

177 

178 

179 

180 
181 
182 

183 

184 

185 

186 

187 

188 

189 

190 

191 

192 

193 

194 

195 

196 

197 

198 

199 

200 



PURPOSE: To calculate (a * b) mod nT2, while trying to avoid over- 

flow. This function is adapted from Sedgewick's ‘mult* 
function on page 513 of [l] . 

INCLUDE: "mathx.h" 

CALLS: 

CALLED BY: lclrandO 

PARAMETERS: long a, b, m. 

RETURNS: long (a * b) mod m~2. 

/ 

#if def PROTOTYPE 

long multmod(long a, long b, long m) ; 

#else 

long multrnod(/* long a, long b, long m */); 

#endif 



FUNCTION DECLARATION 



PURPOSE: To calculate the value of two raised to the (n) power. This 

function [unlike the macro P0V2() given in macros. h] will 
handle the case where (n == 0) . This function uses left 
shifts to achieve the result, so if you ask for too large a 
value, the result is not guaranteed. The value of n is 
ASSUMED to be a POSITIVE integer. 

INCLUDE: "mathx.h" 

CALLS : 

CALLED BY: 

PARAMETERS: The desired power of two, n. 
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201 

202 

203 

204 

205 

206 
207 
206 

209 

210 
211 
212 

213 

214 

215 

216 
217 
216 

219 

220 
221 
222 



♦ RETURNS: 

* 

* 

♦/ 



#if def PROTOTYPE 

long pov2(int n) ; 

#else 

long pow2(/* int n */) ; 
#endif 



The function returns the value of 2~(n). 



/♦ 



♦/ 



EOF mathx.h 




num_sys.h 



1 /* ======= = = = PROGRAM INFORMATION ========== 



2 * 

3 * SOURCE : num_sys.h 

4 * VERSION : 1.4 

5 * DATE : 09 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 * 

s * 

9 * ============== REFERENCES ============== 

10 * 

11 * [l] Goldberg, David. ‘‘Vhat Every Computer Scientist Should Know About 

12 * Floating-Point Arithmetic . } } ACM Computing Surveys, Vol. 23, 

13 * No. 1, March, 1991, pp. 6-48. 

14 * 

15 * [2] Hayes, John P. * ‘Computer Architecture and Organization . * * McGraw- 

16 * Hill Book Company, New York, Second Edition, 1988, p. 196. 

17 * 
is * 

19 * ============== DESCRIPTION ============= 

20 * 



21 * 

22 * 

23 * 

24 * 

25 * * 

26 * 

27 * 

26 * 

29 * 

30 * 

31 * 

32 * • 

33 */ 

34 

35 

36 /* 

37 * 

36 * 

39 * 

40 * 

41 * 

42 * 

43 * 

44 * 

45 * 

46 * 

47 * 

46 * 

49 * 

50 * 



The "num_sys" group of functions relate to number systems (e.g. binary, 
decimal, hexadecimal). 



========== LIST OF FUNCTIONS ============ 



binrepO 

binvecQ 

hexrepO 

ieeerepO 



========== FUNCTION DECLARATION ========== 

PURPOSE: To display the binary representation of a number. Given the 

parameters described below, binrepO prints the binary 
representation. For numbers of type double, type float, or 
type int; binrepO reverses the order of the bytes from the 
machine storage. This makes them more readily recognizable 
as [ SIGN ] [ EXPONENT ] [ MANTISSA 3 for the floating-point 
types and orders the bytes in order of decreasing signifi- 
cance for the integers. 

INCLUDE: "num.sys.h" 

CALLS: 
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51 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 
62 

63 

64 



* 

* 

♦ 

* 

♦ 

♦ 

* 

♦ 

* 

* 



CALLED BY: 

PARAMETERS: The function needs to know what type of number you axe 
sending in, so use the types given in matrix. h. The 
function understands TYPE_CHAR, TYPE_DOUBLE, TYPE_FL0AT, 
and TYPE_INT). It also needs a pointer to the_number. 

EXAMPLE: float f; 

binrep(TYPE_FLOAT, *f ) ; 



♦ 






♦/ 



65 #if def PROTOTYPE 



66 

67 void binrep(int number_type, void *the_number ) ; 

66 

69 #else 

70 

71 void binrepQ ; 

72 

73 #endif 

74 

75 

76 

77 /* ========== FUNCTION DECLARATION ========== 



76 


* 






79 


* 


PURPOSE: 


To expand the bits of the input into an axray of integers. 


60 


* 




The axray only holds zeros and ones, with each element 


61 


♦ 




representing a bit of the input number. 


62 


* 






63 


* 


INCLUDE: 


H num_sys .h" 


84 


* 






85 


♦ 


CALLS : 




66 


* 






87 


* 


CALLED BY: 




88 


♦ 






89 


♦ 


CAUTION: 


This function returns the bits AS THEY ARE IN THE MACHINE! 


90 


* 




Many machines store type double, type float, and type int 


91 


* 




so that their bytes axe in an order that is the reverse of 


92 


* 




what you might expect. Of course, the bits within a byte 


93 


♦ 




are in the expected (msb lsb) order. 


94 


♦ 






95 


* 


PARAMETERS: 


The function needs to know what type of number you axe 


96 


♦ 




sending in, so use the types given in matrix. h. The 


97 


* 




function recognizes TYPE.CHAR, TYPE_D0UBLE, TYPE_FL0AT, and 


98 


* 




TYPE_INT. It also asks for a pointer to the number. 


99 


* 






100 


* 


RETURNS: 


A pointer to int. The function will take caxe of allocation 
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num_sys.h 



101 


* 




for this pointer, and it will fill the array with the bits 


102 


* 




of the number. For indexing purposes, you will probably 


103 


* 




need to know how big this vector is. Multiply the 


104 


* 




[sizeof(type you are sending in)] by 8 (bits/byte) . That's 


105 


* 




how many elements will be in the returned vector of integer 


106 


* 




(bits). This pointer will be MULL if there was an alloca- 


107 


* 




tion problem. 


108 


* 






109 


* 


EXAMPLE: 




110 


* 






111 


* 


float f; 


Assume that this takes 4 bytes * 8 bits 


112 


* 






113 


♦ 


int *v; To hold the bit vector of f (32 elements) 


114 


* 






115 


* 


v = binvec (TYPE_FL0AT , kf); 


116 


* 






117 


* 


— 


it 

it 

it 

ii 

n 

ii 

it 

it 

ii 

ii 

n 

ii 

ii 

ii 

ii 

ii 

ii 

ii 

n 

n 

n 

ii 

ii 

ii 

ii 

ii 

ii 

ii 

ii 

ii 

n 

ii 

n 

n 

ii 

n 

ii 

ii 

ii 

ii 

n 

ii 

ii 

ii 

ii 

ii 

! 

j 

! 

i 

i 


118 


*/ 




119 








120 


#if def PROTOTYPE 


121 








122 




int *binvec(int number_type, void *the_number) ; 


123 








124 


#else 




125 








126 




int *binvec() 




127 








128 


#endif 




129 








130 


/* 






131 


/ * 






132 


* 






133 


* 


PURPOSE: 


To display the hexadecimal representation of a number. 


134 


* 






135 


* 


INCLUDE: 


"num.sys .h" 


136 


* 






137 


* 


CALLS : 




138 


* 






139 


* 


CALLED BY: 




140 


* 






141 


* 


PARAMETERS: 


The function needs to know what type of number you are 


142 


* 




sending in, so use the types given in matrix. h. The 


143 


* 




function recognizes TYPE.CHAR, TYPE_D0UBLE, TYPE.FL0AT, and 


144 


* 




TYPE_INT. It also needs a pointer to the number. 


145 


* 






146 


* 


EXAMPLE: 


float f; 


147 


* 






148 


* 




printf("The hexadecimal representation of %f is: ", f); 


149 


* 




hexr ep (TYPE_FL0AT , *f); 


150 


* 







2SS 



num_sys.li 



151 

152 

153 

154 

155 

156 

157 
156 

159 

160 
161 
162 

163 

164 

165 

166 
167 
166 

169 

170 

171 

172 

173 

174 

175 

176 

177 
176 
179 
160 
161 
162 

163 

164 

165 

166 
167 
166 
169 

190 

191 

192 

193 



*/ 

#ifdef PROTOTYPE 

void hexrep(int number_type , void *the_number) ; 
#else 

void hexrepO ; 
tendif 



/♦ - 
* 

♦ 

* 

♦ 

* 

* 

* 

♦ 

♦ 

* 

♦ 

* 

♦ 

♦ 

♦ 

* 

* 

* 

* 

* 

* 

♦ 

♦ 

* 

* - 

*/ 



FUNCTION DECLARATION 



PURPOSE: 



INCLUDE: 
CALLS : 
CALLED BY: 
PARAMETERS: 



To display binary and IEEE representation of a number. This 
is nearly a tutorial function! It displays a binary repre- 
sentation of the number, and then breaks out the sign, 
exponent, and mantissa (or signif icand) . Some terse trans- 
lation tips are also provided. 

"num_sys . h" 



The function needs to know what type of number you axe 
sending in, so use the types given in matrix. h. This 
function ONLY recognizes the floating-point types (i.e., 
TYPE_D0UBLE and TYPE_FL0AT) . It also needs a pointer to 
the number. 



EXAMPLE: 



float f; 

printf("The IEEE 754 representation of V.f is: 
ieeerep(TYPE_FLOAT , *f); 



f); 



#if def PROTOTYPE 



194 

195 void ieeerep(int number_type , void *the_number) ; 

196 

197 #else 
196 

199 void ieeerepO; 

200 
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num_sys.h 



201 #endif 

202 

203 

204 /* ============ EOF num_sys.h 
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PROGRAM INFORMATION 



ops.h 



1 /* ===== 

2 * 

3 * SOURCE ops.h 

4 ♦ VERSION : 1.7 



5 * DATE : 09 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 * 

8 * 

9 * ==== = = = = = = = = = === REFERENCES === = = === = ===== 

10 * 

11 * [1] Golub, Gene H., and Charles F. VanLoan. Matrix Computations. The 

12 * Johns Hopkins University Press, Baltimore, 1989. 

13 * 

14 * 

15 * ============== DESCRIPTION ============= 

16 * 



17 * The functions declared below perform matrix and vector operations. For 
is ♦ the sake of brevity, I will often use simple (MatLab-style) notation in 

19 * comments. For instance, x* means x transpose (i.e. a row). Do not 

20 * confuse the comment shorthand with what is really happening in the 

21 * code. My goal is to get function specifications across clearly and 

22 * succinctly without excessive concern for implementation. Here are a 

23 * few notes . 

24 * 

25 * An operation preceded by a means ’’elementwise". For instance, 

26 * x .♦ y means the elementwise vector multiplication of x by y. That is, 

27 * the result would be some vector z like: 

28 * 

29 * z* = [ x [l] *y [l] , x [2] *y [2] , , x[n]*y[n] ] 

30 * 

31 * If the operation appears without the preceding M . M , it means the vector 

32 * operation. 

33 * 

34 * 

35 * =========== LIST 0F FUNCTIONS ============ 

36 * 

37 * COls ( ) 

38 * dot_product ( ) 

39 * matrix_product () 

40 * max.element () 

41 * normpO 

42 * outer_product ( ) 

43 * rowsQ 

44 * swap_cols() 

45 * swap_rows() 

46 ♦ vec_init() 

47 * 

49 */ 

50 
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FUNCTION DECLARATION 



ops.h 



51 /* 

52 * 

53 * PURPOSE: To return the number of columns in the matrix A. 

54 * 

55 * INCLUDE: "ops.h" 

56 * 

56 */ 

59 

60 #if def PROTOTYPE 

61 

62 int cols(Double_Matrix_Type *A) ; 

63 

64 #else 

65 

66 int cols(/* Double_Matrix_Type *A */) ; 

67 

66 #endif 

69 

70 

71 /* = === = = = = = FUNCTION DECLARATION ========= 



72 


* 






73 


* 


PURPOSE: 


Computes the dot product of the input vectors x and y which 


74 


* 




is defined in [l] (page 4). The dot product of x and y is 


75 


* 




x* * y. 


76 


* 






77 


* 


PARAMETERS: 


: The vectors x and y should be arrays of type double, each 


76 


* 




having "size" elements. 


79 


* 






60 


* 


INCLUDE: 


"ops .h" 


61 


* 






62 


* 


CALLS: 


N/A 


63 


* 






64 


* 


CALLED BY: 


matrix_product () [see below] 


65 


* 






66 


* 

* 


RETURNS : 


A double (scalar) value equal to the dot product x* * y. 


66 


* 


EXAMPLE: 


The following example would conclude with answer == 10.0. 


69 


* 






90 


* 


double 


answer; 


91 


* 






92 


* 


static 


double x[] = { 1.0, 2.0, 3.0 >, 


93 


* 




y[] = {3.0, 2.0, 1.0 >; 


94 


* 






95 


* 


int 


size = 3; 


96 


* 






97 


* 


answer 


= dot_product (x , y, size); 


96 

OG 


* 

£ 






y y 
100 


*/ 
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ops.li 



101 

102 



103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 


#if del PROTOTYPE 

double dot_product (double *x, double *y ( int size); 

#el8e 

double dot_product (/* double *x, double *y , int size */) ; 
#endif 


116 








117 


* 






116 


* 


PURPOSE: 


To multiply matrices A and B ( placing the product in C. 


119 


♦ 






120 


* 


INCLUDE: 


"ops .h" 


121 


* 






122 


* 


CALLS: 


dot_product [see above] 


123 


* 






124 


* 


CALLED BY: 




125 


* 






126 


* 


PARAMETERS: 


The parameters tell the size of the matrix. 


127 


* 






126 


♦ 


RETURNS: 


SUCCESS if the matrices were compatible for multiplication 


129 


* 




and C contained enough space to contain the entire result. 


130 


* 




FAILURE if A and B were incompatible or C was not big 


131 


* 




enough to hold the product. The values for SUCCESS and 


132 


* 




FAILURE are given in 'matrix. h*. 


133 


* 






134 


♦ 


EXAMPLE: 


Double_Matrix_Type *A, 


135 


* 




*B , 


136 


* 




*C ; 


137 


* 






136 


* 




if (matrix_product(A,B, C) == FAILURE) { 


139 


* 






140 


♦ 




printf ("matrix_product (A,B,C) failed. \n") ; 


141 


* 




exit (EXIT_FAILURE) ; 


142 


* 




> 


143 


* 




else { 


144 


* 






145 


* 




printf ("C contains A * B.\n M ); 


146 


* 




> 


147 


* 






146 


* 


— 
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ii 
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n 

n 

n 

i 

i 

i 

i 

i 

i 

i 

i 

i 

i 

i 

i 

i 
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*/ 




150 
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ops.h 



151 

152 

153 

154 

155 

156 

157 
156 

159 

160 
161 
162 

163 

164 

165 

166 
167 
166 

169 

170 

171 

172 

173 

174 

175 

176 

177 
176 
179 
160 
161 
162 

163 

164 

165 

166 
167 
166 
169 

190 

191 

192 

193 

194 

195 

196 

197 
196 

199 

200 



#if def PROTOTYPE 

int matrix.product (Double_Matrix_Type ♦A, 
Double_Matrix_Type *B, 
Double_Matrix_Type *C) ; 

#else 

int matrix_product() ; 

#endif 



/♦ 

♦ 

* 

* 

* 

♦ 

♦ 

* 

* 

♦ 

* 

* 

* 

♦ 

* 

♦ 

* 

♦ 

* 

* 

* 

* 

♦ 

♦ 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

♦ 

♦ 

* 



FUNCTION DECLARATION 



PURPOSE: To search the elements below and to the right of A(k,k) for 

the element that is maximum in absolute value. 



INCLUDE: 



<math.h> 
"ops .h" 



[link using -1m if necessary] 



CALLS: fabsQ 

CALLED BY: 

PARAMETERS: A is the matrix (structure), k is the index for a position 
on the main diagonal, A(k,k). The search will be conducted 
for the area of the matrix that lies below k and to its 
right: 

(k,k) > 

I This is the area that will be searched 
I for an element of maximum absolute value. 

I The search does NOT include row k nor 
I does it include column k. 

Parameters must also include s, the address of an integer 
that will contain the row number for the maximum element 
upon return; and t, an address of an integer to store the 
column number for the maximum element. 

NOTE: To search the WHOLE MATRIX, the parameter k should be (-1). 

The values of k, s, and t should be interpreted as the C 
versions of indexes (i.e. beginning with 0). 

RETURNS: The function returns the maximum (in absolute value) 

element found in A (type double). Additionally, the index 
values for this element are placed in the variables pointed 
to by s (row) and t (col). 
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EXAMPLE: 



ops.h 



201 

202 

203 

204 

205 

206 
207 
206 

209 

210 
211 
212 

213 

214 

215 

216 
217 
216 

219 

220 
221 
222 

223 

224 

225 

226 
227 
226 

229 

230 

231 

232 

233 

234 

235 

236 

237 
236 

239 

240 

241 

242 

243 

244 

245 

246 

247 
246 

249 

250 



Double_Matrix_Type *A; 

double u; 

int k , 

8 , 

t; 

u = max_element (A , k, As, At); 

/ 

#ifdef PROTOTYPE 

double max_element (Double_Matrix_Type ♦A, int k, int *s, int *t); 
#else 

double max_element ( ) ; 

#endif 



FUNCTION DECLARATION 



Computes the p-norm of the input vector x defined in [l] 
(page 53). 

<math.h> 

"ops .h" 

f abs () 



PURPOSE: 

INCLUDE: 

CALLS : 

CALLED BY: 

PARAMETERS: x is the vector. It must contain "size" elements of type 
double. The p argument is the p of p-norm. 

RETURNS: A double (scalar) value equal to the p-norm of x. 

EXAMPLE: 

static double x[] = { 1.0, 2.0, 3.0 }; 
double Euclidean_norm_of _x ; 
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ops.h 



251 ♦ Euclidean_norm_of_x = normp(x, 2, 3); 

252 * 

253 * = = = = = = = = = = = = = === = = === = = = = = = = = 

254 */ 

255 

256 #if def PROTOTYPE 

257 

256 double normp(double *x, int p, int size); 

259 

260 #else 

261 

262 double normpO; 

263 

264 #endif 

265 

266 



267 


/* 






266 


* 






269 


* 


PURPOSE: 


To place the outer product of x and y in C. 


270 


* 






271 


* 


INCLUDE: 


*'ops .h" 


272 


* 






273 


* 


CALLS : 


N/A 


274 


* 






275 


* 


CALLED BY: 


N/A 


276 


* 






277 


♦ 


ASSUMPTION: 


The matrix associated with C is already allocated to the 


276 


* 




proper size. 


279 


* 






260 


* 


PARAMETERS: 


Two vectors, x and y, of sizes x_size and y_size; and the 


261 


* 




matrix associated with C to accept the outer product. 


262 


♦ 






263 


♦ 


RETURNS: 


The matrix associated with C is filled with the proper 


264 


♦ 




values . 


265 

266 


♦ 

* 







267 */ 

266 

269 

290 #if def PROTOTYPE 

291 

292 void outer_product (double *x, int x_size, double *y, int y_size, 

293 double **C) ; 

294 #else 

295 

296 void outer_product () ; 

297 

296 #endif 

299 

300 



296 



FUNCTION DECLARATION 



ops.li 



301 /* 

302 * 

303 * PURPOSE: To return the number of rows in the matrix A. 

304 * 

305 ♦ INCLUDE: "ops.h" 

306 ♦ 

306 */ 

309 

310 #if def PROTOTYPE 

311 

312 int rows (Double_Matrix_ Type *A); 

313 

314 #else 

315 

316 int rowsO ; 



317 

316 #endif 

319 

320 

321 

322 /* ========== FUNCTION DECLARATION ========== 

323 ♦ 

324 * PURPOSE: To swap columns p and q in the matrix contained within A. 

325 ♦ 

326 ♦ INCLUDE: M ops.h n 

327 ♦ 

326 ♦ CALLS: N/A 

329 * 



330 * CALLED BY: 

331 * 

332 * PARAMETERS: A is the structure holding the matrix. The integers p and 

333 * q are the column numbers to be swapped. Indexes are 

334 * numbered according to the C convention (beginning at zero) . 

335 * 

336 * RETURNS: Upon return, the columns have been swapped in A. 

337 * 

339 */ 

340 

341 #if def PROTOTYPE 

342 

343 void swap_cols(Double_Matrix_Type *A, int p, int q) ; 

344 

345 #else 

346 

347 void swap_cols(); 

346 

349 #endif 

350 
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FUNCTION DECLARATION 



ops.h 



351 / 

352 

353 

354 

355 

356 

357 
356 

359 

360 

361 

362 

363 

364 

365 

366 

367 
366 
369 



PURPOSE: To swap rows p and q in the matrix contained within A. 

INCLUDE: "ops.h" 

CALLS: N/A 

CALLED BY: 

PARAMETERS: A is the structure holding the matrix. The integers p and 
q axe the row numbers to be swapped. Indexes axe numbered 
according to the C convention (beginning at zero). 

RETURNS: Upon return, the rows have been swapped in A. 



/ 



370 #if def PROTOTYPE 



371 

372 void swap.rows (Double_Matrix_Type ♦A, int p, int q) ; 

373 

374 #else 

375 

376 void swap_rows(); 

377 

376 #endif 



379 

360 

361 

362 

363 /* ========= FUNCTION DECLARATION ========= 

364 * 

365 * PURPOSE: To initialize the vector v of n integers with the values 

366 * 1, 2 , 3, ...» n. 

367 * 

366 * INCLUDE: "ops.h" 

369 * 

390 * CALLS : 

391 * 

392 * CALLED BY: 

393 * 

394 * ASSUMPTION: The vector, v, has already been successfully allocated as 

395 * an axray of n integers. 

396 * 

397 * PARAMETERS: The vector, v, to be initialized; and its size, n. 

396 * 

399 * RETURNS: The vector's elements axe set to the new values and these 

400 * values axe in v[] upon return. 



29S 



ops.h 



401 * 

402 * = = = = = = = = = = = = = = = = = = = = = = = = = = 

403 */ 

404 

405 

406 #if def PROTOTYPE 

407 

40S void vec_init (int ♦v, int n) ; 

409 

410 #else 

411 

412 void vec_init(); 

413 

414 #endif 

415 

416 

417 /♦ == ============ EOF ops.h 
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timing. h 



1 /* 

2 * 

3 * 

4 ♦ 

5 * 

6 * 

7 * 

8 * 

9 * 

10 * 

11 * 

12 * 

13 * 

14 * 

15 * 

16 * 

17 * 

18 * 

19 * 

20 * 

21 * 

22 * 

23 * 

24 * 

25 * 

26 * 

27 * 

28 * 

29 * 

30 */ 

31 

32 

33 

34 

35 

36 /* = = 

37 

38 #ifdef TRANSPUTER 

39 

40 #define 

41 tdefine 

42 tdefine 

43 tdefine 

44 

45 #else /* iPSC/2 */ 

46 

47 tdefine M.PERIOD 1.0e-3 /* 

48 tdefine M.FREQ 1.0e-3 /* 

49 

so tendif 



period of low priority clock */ 
period of high priority clock */ 
frequency of low priority clock */ 
frequency of high priority clock */ 



period of Intel's mclockO */ 

frequency for Intel's mclockO */ 



PROGRAM INFORMATION 



SOURCE 

VERSION 

DATE 

AUTHOR 



timing .h 

1.2 

09 September 1991 

Jonathan E. Hartman, U. S. Naval Postgraduate School 



REFERENCES 



REFERENCES : 

[1] Inmos. The Transputer Databook, Second Edition, 1989. 

[2] Intel. iPSC/2 Programmer ' s Reference Manual. 



DESCRIPTION 



This file contains definitions of manifest constants, type definitions, 
and function declarations for time-related tasks on the Intel iPSC/2 or 
a network of Inmos transputers. 



LIST OF FUNCTIONS 



clock() 
delay () 



MANIFEST CONSTANTS 



♦/ 



L0.PERI0D 64 . Oe-6 /* 
HI. PERIOD 1 . Oe-6 /♦ 
L0.FREQ 15625.0 /* 
HI. FREQ 1 . 0e6 /* 
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timing, h 



51 

52 

53 /♦ ============ TYPE definitions ============ 

54 * 

55 * The type 'ticks 1 is defined in an effort to make timing a bit more 

56 * transparent across the machines listed. 

57 * 

59 */ 

60 

61 #if def TRANSPUTER 

62 

63 typedef int ticks; 

64 

65 #else /* iPSC/2 */ 

66 

67 typedef unsigned long ticks; 

66 

69 #endif 

70 

71 

72 

73 

74 

75 /* ======= = = FUNCTION DECLARATION ========= 



76 


* 








77 


* 


PURPOSE: 


To get the time (in 


ticks) from the processor's clock. 


76 


* 








79 


♦ 


INCLUDE: 


<conc .h> 


(Logical Systems C, version 89.1) 


60 


* 




"timing .h" 




61 


♦ 








62 


* 


CALLS: 


Time() 


(Logical Systems C, version 89.1) 


63 


* 




mclockQ 


(Intel iPSC/2 C) 


64 


* 








65 


* 


CALLED BY: 






66 


* 








67 


* 


PARAMETERS : 


None . 




66 


♦ 








69 


* 


RETURNS : 


The function samples 


the clock and returns ticks. More 


90 


* 




information on ticks 


, period, and frequency is given in the 


91 


* 




definitions above. 




92 


♦ 








93 


* 


EXAMPLE: 


ticks t [2] ; 




94 


* 








95 


* 




t [0] = clockQ; 




96 


* 








97 

96 


*/ 








99 










100 
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timing, h 



101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 
126 

129 

130 

131 

132 

133 

134 

135 

136 

137 
136 

139 

140 

141 

142 

143 

144 

145 

146 

147 
146 



tifdef PROTOTYPE 

ticks clock(void); 

#else 

ticks clock(/* void */) ; 
tendif 



/* 

* 



FUKCTI0K DECLARATIOI 



* PURPOSE: To force a delay of at least a given amount (in seconds) in 



program execution. 



(Logical Systems C, version 89.1) 



"timing. h" 

ProcGetPriority () 

Time() 

mclock( ) 



(Logical Systems C, version 89.1) 
(Logical Systems C, version 89.1) 
(Intel iPSC/2 C) 



♦ IiCLUDE: <conc.h> 

* 

* 

♦ CALLS: 

♦ 

* 

* 

♦ CALLED BY: 

♦ 

* PARAMETERS: The (float) argument tells the function the minimum time 

* (in seconds) to delay. 

* 

* EXAMPLE: delay(1.25); 

* 

*/ 

tifdef PROTOTYPE 

void delay(float seconds); 

#else 

void delay( /* float seconds */ ); 
tendif 



/* 



EOF timing. h 



♦/ 
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E. GAUSS FACTORIZATION CODE 



The Gauss factorization code appears on the pages that follow. First, the code 
for partial pivoting is given. Since the complete pivoting case was very similar, most 
of it has been omitted to save space. The pivot election function, however, is shown 
in a fragment of gfpcnode.c, the node code for GF with Pivoting (Complete). 
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gfpp.mak 



1 # 

2 # 

3 # PURPOSE : Makefile lor Hypercube Gauss Factorization (GF) Program 

4 # AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

5 # DATE : 26 August 1991 

6 # 

7 # 

8 

9 ROOTCODE=gfpphost 

10 N0DEC0DE=gfppnode 

11 HEADER=gf 

12 NIF_FILE=gfpp 

13 

14 

15 # OPTIONS AND DEFINITIONS 

16 # 

17 # iPSC/2 Section (MDIR == MatLib directory) 

18 

19 MDIR=/usr/hartman/matlib/ 

20 
21 

22 # Transputer Section 

23 # 

24 # The following section establishes options and definitions, starting 

25 # with PP, the Logical Systems C Preprocessor. The '-dX' option (with no 

26 # macro_expression) is like 'tdefine X 1 ' . Next the compilation options 

27 # for Logical Systems' TCX Transputer C Compiler are given. The *-c' 

28 # means compress the output file. The options beginning with '-p' tell 

29 # TCX to generate code for the appropriate processor: 

30 # 



31 


# 


- P 2 


T212 


or 


T222 


32 


# 


-p25 


T225 






33 


# 


-p4 


T414 






34 


# 


-p45 


T400 


or 


T425 


35 


# 


-p8 


T800 






36 


# 


-p85 


T801 


or 


T805 



37 # 

38 # Logical Systems' TASM Transputer Assembler is next. The '-c' means 

39 # compress the output file (it can cut it in half)! The *-t' is used 

40 # because the input to TASM will be from a language translator (TCX's 

41 # output) and not from assembly source code. 

42 # 

43 # The final list tells TLNK which libraries to look at during linking. 

44 # It also establishes an entry point. We use '.main' for the root node 

45 # and '_ns_main' for other nodes. 

46 

47 PP0PT2=-dPR0T0TYPE -dTRANSPUTER -dT212 

48 PP0PT4=-dPR0T0TYPE -dTRANSPUTER -dT414 

49 PP0PT8=-dPR0T0TYPE -dTRANSPUTER -dT800 

50 TCX0PT2=-cp2 
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gfpp.mak 



51 TCX0PT4=-cp4 

52 TCX0PT8=-cp8 

53 TASMOPT=-ct 

54 T2LIB=t21ib .til 

55 T4LIB=matlib4 . til t41ib.tll 

56 T8LIB=matlib8 .til t81ib.tll 

57 RENTRY=_main 

58 f ENTRY=_ns_main 



59 

60 

61 # DEFAULT == = > MAKE ALL 

62 # 

63 # Comment out one or the other. . . . 

64 # 

65 # all: ipse 

66 # run: irun 

67 # clean: ielean 

68 all: transputer 

69 run: trun 



70 clean: tclean 

71 

72 

73 

74 

75 # ROOT CODE 

76 # 

77 # iPSC/2 Section 



76 

79 ipse: $ (R00TC0DE) $(N0DEC0DE) 

80 

81 $(R00TC0DE) : $(R00TC0DE) . o 

82 cc $ (R00TC0DE) . o $ (MDIR)allocate . o $(MDIR)clargs . o $ (MDIR) commhost . o $(MDIR)generate . o 
S(MDIR) epsilon . o $(MDIR)io.o $(MDIR)mathx . o $(HDIR)ops.o $ (MDIR)timing . o -lm -host 

-o $ (R00TC0DE) 

83 

84 S (R00TC0DE) . o : $ (R00TC0DE) . c $ (HEADER). h 

85 



86 



67 # Transputer Section 
88 

89 transputer: $ (R00TC0DE) . tld $ (NODECODE) . tld 



90 



91 $(R00TC0DE) .tld: $ (R00TC0DE) . trl 



92 echo FLAG c > 

93 echo LIST S (R00TC0DE) .map » 

94 echo INPUT $(R00TC0DE) . trl » 

95 echo ENTRY $(RENTRY) » 

96 echo LIBRARY $(T4LIB) » 

97 tin* $ (R00TC0DE) . Ink 



S(ROOTCODE) .Ink 
$ (R00TC0DE) .Ink 
$(R00TC0DE) .Ink 
$(R00TC0DE) .Ink 
$(R00TC0DE) .Ink 



96 
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99 S(ROOTCODE) .trl: $(R00TC0DE) . tal 

100 tasm $(R00TC0DE) .tal $(TASM0PT) 

101 

102 t(ROOTCODE) .tal: $(R00TC0DE) .pp 

103 tcx $(R00TC0DE) .pp *(TCX0PT4) 

104 

105 $(R00TC0DE).pp: $ (R00TC0DE) . c 

106 pp $(R00TC0DE) . c $(PP0PT4) 

107 

108 

109 

110 
111 

112 # MODE CODE 

113 # 

114 

115 # iPSC/2 Section 

116 

117 $ (NODECODE) : $(N0DEC0DE) . o 

ns cc $(N0DEC0DE) . o $ (MDIR) allocate . o $(MDIR)commnode. o $(MDIR)generate . o $(MDIR)io.o 
$(MDIR)mathx.o $(MDIR)ops.o $ (MDIR) timing . o -node -lm -o $(N0DEC0DE) 

119 

120 $(N0DEC0DE) . o : $ (N0DEC0DE) . c $(HEADER) .h 

121 
122 

123 # Transputer Section 

124 

125 $(N0DEC0DE) .tld: $(N0DEC0DE) . trl 

126 echo FLAG c > $ (NODECODE) . Ink 

127 echo LIST $( N0DEC0DE) .map » $ (N0DEC0DE) . Ink 

128 echo INPUT S(NODECODE) . trl » $ (NODECODE) . Ink 

129 echo ENTRY $(NENTRY) » $ (N0DEC0DE) . Ink 

130 echo LIBRARY $(T8LIB) » $ (N0DEC0DE) .Ink 

131 tlnk $(N0DEC0DE) .Ink 

132 

133 $(N0DEC0DE) .trl: $(N0DEC0DE) . tal 

134 tasm $(N0DEC0DE) .tal $(TASM0PT) 

135 

136 $(N0DEC0DE) .tal: $ (N0DEC0DE) . pp 

137 tcx $(N0DEC0DE) .pp $(TCX0PT8) 

138 

139 $(N0DEC0DE) . pp: $(N0DEC0DE) . c 

140 pp $(N0DEC0DE) . c $(PP0PT8) 

141 

142 

143 

144 

145 

146 # EXECUTION 

147 # 
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14$ 

149 irun: S(ROOTCODE) S(NODECODE) 

150 $ (R00TC0DE) 

151 

152 trim : $(R00TC0DE) . tld S(NODECODE) . tld $ (NIF_FILE) .nif 

153 echo makecube first 

154 Id-net S(IIF.FILE) -t -v 

155 

156 

157 # CLEAN UP 

15$ # 

159 

160 iclean: 

161 m $(N0DEC0DE) . o 

162 rm $ (R00TC0DE) . o 

163 rm $ (N0DEC0DE) 

164 rm S(ROOTCODE) 

165 

166 tclean: 

167 del S(ROOTCODE) .Ink 
16$ del S (NODECODE) .Ink 

169 del $(R00TC0DE) .map 

170 del $ (N0DEC0DE) .map 

171 del $(R00TC0DE) .tal 

172 del $ (N0DEC0DE) .tal 

173 del $ (R00TC0DE) .pp 

174 del S(NODECODE) .pp 

175 del $(R00TC0DE) .trl 

176 del $ (N0DEC0DE) .trl 

177 
17$ 

179 # EOF gfpp.mak 
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2 ; 

3 ; SOURCE : gfpp.nif 

4 ; VERSION : 1.0 

5 ; DATE : 14 September 1991 

6 ; AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 ; USAGE : ld-net gfpp 

s ; 

9 ; 

10 ; ============== REFERENCES ================ 

11 ; 

12 ; [l] Inmos . IMS B012 User Guide and Reference Manual. Inmos Limited, 

13 ; 1988, Fig. 26, p. 28. 

\4 ; 

is ; 

17 ; 

16 ; Network Information File (NIF) used by Logical Systems C (version 89.1) 

19 ; LD-NET Network Loader. This file prescribes the loading action to take 

20 ; place when the ‘ld-net' command is given as in USAGE above. 

21 ; 

22 ; 

23 ; ========= HARDWARE PREREQUISITES ========= 

24 ; 

25 ; NOTE: There are three node numbering systems: the one created by Inmos' 

26 ; CHECK program, the Gray code labeling, and the NIF labeling. Since all 

27 ; three will be used on occasion, I will prefix node numbers with a C, G, 

26 ; or N to identify which system I am using! 

29 ; 

30 ; The IMS B004 and IMS B012 must be configured correctly. The B004's T414 

31 ; has link 0 connected to the host PC via a serial-to-parallel converter, 

32 ; link 1 connected to the IMS B012 PipeHead, link 2 connected to the T212 

33 ; [communications manager (not used here)] on the B012, and link 3 

34 ; connected to the IMS B012 PipeTail (see [l]). By the way, link 2 from 

35 ; the B004 goes to the the ConfigUp slot just under the PipeHead slot 

36 ; (this connects it to the T212) . Finally, the B004's Down link must run 

37 ; to the B012's Up link. 

36 ; 

39 ; 

40 ; ==== SETTING THE C004 CROSSBAR SWITCHES ===== 

4 1 ; 

42 ; Once you have connected the hardware in the fashion mentioned above, 

43 ; the system is ready to be transformed to a hypercube. Three codes by 

44 ; Mike Esposito are used here: t2.nif, root.tld, and switch. tld. I have 

45 ; a batch file called 'makecube.bat' that performs a f ld-net t2' also. 

46 ; 

47 ; Mike's code passes instructions to the T212 on the B012; which, in-turn 

46 ; tells the C004's how to connect their switches. After the code has 

49 ; executed, the (very specific) configuration that we are looking for 

50 ; will exist. Specifically, the following (output from CHECK /R) is what 
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51 ; this process gives us: 

52 ; 

53 ; check 1.21 



54 


# 


Paxt rate Mb Bt 


[ 


LinkO 


Linkl 


Lmk2 


Link3 


55 


0 


T4l4b-15 


0.09 


0 


c 


HOST 


1:1 


2:1 


3:2 


56 


1 


T800c-20 


0.80 


1 


[ 


4:3 


0:1 


5:1 


6:0 


57 


2 


T2 -17 


0.49 


1 


c 


C004 


0:2 


. . . 


C004 


56 


3 


T800C-20 


0.80 


2 


t 


7:3 


8:2 


0:3 


9:0 


59 


4 


T800C-20 


0.76 


3 


c 


9:3 


10:2 


11:1 


1:0 


60 


5 


T800d-20 


0.90 


1 


t 


8:3 


1:2 


10:1 


12:0 


61 


6 


T800d-20 


0.76 


0 


[ 


1:3 


12:2 


7:1 


11:0 


62 


7 


T800d-20 


0.76 


3 


c 


13:3 


6:2 


14:1 


3:0 


63 


8 


T800d-20 


0.90 


2 


c 


14:3 


15:2 


3:1 


5:0 


64 


9 


T800c-20 


0.77 


0 


[ 


3:3 


13:2 


15: 1 


4:0 


65 


10 


0 

CM 

1 

TJ 

O 

O 

00 

H 


0.90 


2 


c 


16:3 


5:2 


4:1 


15:0 


66 


11 


T800d-20 


0.90 


1 


[ 


6:3 


4:2 


16:1 


13:0 


67 


12 


0 

CM 

1 

O 

O 

00 

H 


0.77 


0 


[ 


5:3 


16:2 


6:1 


14:0 


66 


13 


T800d-20 


0.77 


3 


t 


11:3 


17:2 


9:1 


7:0 


69 


14 


T800c-20 


0.90 


1 


c 


12:3 


7:2 


17:1 


8:0 


70 


15 


T800C-20 


0.90 


2 


c 


10:3 


9:2 


8:1 


17:0 


71 


16 


T800C-20 


0.76 


3 


c 


17:3 


11:2 


12:1 


10:0 


72 


17 


T800d-20 


0.88 


2 


[ 


15:3 


14:2 


13:1 


16:0 



] 

3 

] 

] 

] 

] 

] 

] 

] 

] 

] 

] 

] 

] 

] 

] 

] 

] 

] 



73 ; 

74 ; Here node CO is the root transputer (on the IMS B004) and node C2 is 

75 ; the T212 (on the IMS B012) . The other sixteen nodes axe the TSOO's 



76 ; that are used lor the work. A logical interconnection topology is 

77 ; described below. 



76 

79 

60 



TOPOLOGY ================ 



61 

62 

63 

64 

65 

66 
67 
66 
69 

90 

91 

92 

93 

94 

95 

96 

97 
96 
99 

100 



The physical interconnection scheme described above is an actual 4-cube 
with one exception. The root node (CO) is situated BETVEEN nodes Cl 
and C3 (which would be connected directly in the usual 4-cube). This 
gives us two 3-cubes: one whose node labeling is GOxxx and the other, 

whose node labeling is Glxxx (where the xxx represents all permutations 
o f 3-bits). These axe the usual three cubes, and they will exist i f we 
define the node number ing/labeling correctly. 



================ STRATEGY ================ 

The node labeling established by the IIF is available via the vaxiable 
_node_number (see <conc.h>) in source code. Therefore, we would like a 
smaxt labeling scheme in the IIF file so that programming is easier. 
This, of course, is subject to the restriction that IIF labels begin 
with II and so on. 

One such method would be to define a IIF labeling so that the Gray code 
label for a node would be (_node_ number -2). In fact, this is 
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101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 
126 

129 

130 

131 

132 

133 

134 

135 



possible and the adjacencies defined below allow us to realize this 
feature. Below, node 10 is the host PC, node II is the root transputer 
(T414 on the B004) , 12 through 117 correspond to GO through G15 (the 
nodes of a 4-cube), and 118 is not used (but it's the T212) . 



host_server cio.exe; (default) 



MODE 

ID 

1 . 

2 , 

3, 

4, 

5 , 

6 , 

7, 

8 , 

9, 

10 , 

11 , 

12 , 

13, 

14, 

15, 

16, 

17, 

18, 



TRAHSPUTER 
LOADABLE 
CODE ( . tld) 

gfpphost , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
gfppnode , 
switch , 



RESET 

COMES 

FROM: 

r0, 
rl, 
r2 , 
r5 , 
r3 , 
r7, 
r9, 
r4, 
r8, 
rll. 
rl3 , 
rl6 , 
rl2, 
r6, 
rl4, 
rl7, 
rl5, 
si. 



DESCRIPTI0I OF LIIK C0HHECTI0HS 
LIIK0 LIHK1 LIHK2 LIIK3 



0 . 

4, 
H, 
12 , 

9, 

2 , 

3, 

6. 

17. 

14. 

15. 

10 . 

5. 

16. 

7, 

8, 

13. 



2. 

1. 

2 , 

5. 

3, 

7, 
9, 

4, 

8 . 

11 , 

13, 
16. 
12 , 

6, 

14, 
17. 

15, 

1, 



3. 

5, 

8. 

4. 

14, 

6, 
9, 
7. 
1, 

10 . 

13, 

H, 

15, 
17, 
12 , 

16, 



10 

6 

7 
2 

13 

8 

15 

16 
5 

12 

3 

4 

17 

10 

11 

14 
9 



B004 

B012 



T212 



EOF gfpp.nif 
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1 /* ==== 

2 * 

3 * SOURCE gf.h 

4 * VERSION : 2.5 

5 * DATE : 21 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. Naval Postgraduate School 

7 * 

6 * SEE ALSO: gfpc.mak makefile for the complete pivoting case 

9 * gfpp.mak makefile for the partial pivoting case 

10 * gfpchost.c host code for the complete pivoting case 

11 * gfpphost.c host code for the partial pivoting case 

12 * gfpcnode.c node code for the complete pivoting case 

13 * gfppnode.c node code for the partial pivoting case 

14 * 

15 * 

16 ♦ ========== = = = = REFERENCES ============== 

17 * 



is * [1] Gragg, William B. MATLAB code and personal conversations, 1991. 

19 * 

20 * 

21 * ============== DESCRIPTION ============= 

22 * 

23 * This header file is shared by several programs (listed above). Each of 

24 * these codes has something to do with a parallel implementation of Gauss 

25 * Factorization (GF) . Several pivoting strategies are supported. Files 

26 * like gfpc*.* represent a COMPLETE pivoting strategy, and the files like 

27 * gfpp*.* give the corresponding code for the PARTIAL pivoting scheme. 

26 * 

29 * The basic algorithm is from [1] . Parallelism is sought by distributing 

30 * the columns of A across the nodes of a multiprocessor system (using the 

31 * hypercube interconnection topology). The program is designed for the 

32 * Intel iPSC/2 or a network of Inmos transputers. 

33 * 

34 * The algorithm factors Q'AP = LU with P and Q permutation matrices, L 

35 * unit lower trapezoidal (r columns) and U upper trapezoidal with nonzero 

36 * diagonal elements (r rows). The program is designed for a general 

37 * matrix, A. It does not assume A square or sparse. There is no effort 

36 * to optimize for this, or any other, special structure. There is one 

39 * caveat: I designed the code to gather data for squaxe matrices of full 

40 * rank. Therefore, I have tested the squaxe case of random matrices very 

41 * carefully. While the code should work for any general matrix, it has 

42 * not been carefully tested in other cases. Additionally, since I sought 

43 * timing data for matrices of full rank, I have NOT addressed the problem 

44 * of gathering columns (back to the host) to the right of the final pivot 

45 * for rank-deficient matrices. This would not be a difficult task, but I 

46 * did not make this effort since it has no bearing on my goal. 

47 * 

46 * In the partial pivoting code, the search for pivots is carried out only 

49 * in the pivot column, so P is the identity (i.e., there are no column 

50 * interchanges). Many of the remaining comments pertain to the complete 
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51 * 

52 * 

53 * 

54 * 

55 * 

56 * 

57 * 

56 * 

59 * * 

60 */ 
61 

62 

63 

64 /* ' 

65 * 

66 * 

67 * 

66 * 

69 * 

70 * 

71 * 

72 * 

73 * 

74 * 

75 * 

76 * 

77 * 

76 * 

79 * 

60 * 

61 * 

62 * 

63 * 

64 * 

65 * 

66 * 

67 * 

66 * 
69 * 

90 * 

91 * 

92 * 

93 * 

94 * 

95 * 

96 * 

97 * 

96 * 

99 * 

100 * 



pivoting case, since it is the most challenging. The changes for the 
partial pivoting case should be evident in most cases. At times, when 
the changes sure not necessarily evident, clarifying remarks address the 
partial pivoting scheme. This header file contains the majority of the 
background and algorithm information, but if you're after a careful 
study of the differences, compare the source codes. The algorithm below 
gives a road map through the code. 



=========== ALGORITHM: BACKGR0UID ========== 

1. ) Preliminaries. Consider A (m x n) , a matrix of real numbers. The 
permutation vectors, p and q, characterize column and row permutations 
(respectively). The scalar, (g/a) , is the growth factor. The integer, 
r, is a fairly reasonable determination of the 'numerical rank' of A. 
The C language convention is followed, numbering rows and columns from 
zero; and storing dynamic, two-dimensional arrays (matrices) in row- 
major-order. The 'pivot' will be that element located at A(k,k). The 
area (in A) below and to the right of the pivot [all A(i,j) where i > k 
and j > k ] is called the 'Gauss transform area'. 

2. ) Communications and Coordination. Let V be the number of processors 
(workers) in the hypercube. These nodes are labeled with a Gray code 

{ 0 .. (N - 1) }. The root (host) node distributes the columns of A to 
the nodes. This is done cyclically, using the C modulus operator ('/♦). 
That is, column j will be sent to processor (j mod N). Once the nodes 
have their columns , they begin work. Communication (for the complete 
pivoting case) involves an election process for the next pivot, where 
each of the nodes finds its best candidate and then the election finds 
the best candidate in the global picture. This is done in lg(H) steps 
using the cubecast_f rom() function. 

The partial pivoting case does not require the election process that 
complete pivoting needs, but both methods look similar (in terms of 
communication) after the elections are complete. The node holding the 
pivot column must perform the pivot column arithmetic and distribute 
the resulting pivot column (also in lg(H) steps) to the other nodes. 
Communications functions are not explained much in this code, but 
details can be found in the files comm.h k comm.c. 

3. ) Pivoting Strategy. The complete pivoting strategy's election 
process (at each stage), determines the element in (the entire Gauss 
transform area of) A that is largest in absolute value. This element 
wins the election and is 'moved' to A(k,k) for the upcoming stage. It 
isn't really moved... but p and q are updated so that we can keep track 
of permutations. During the search for the new pivot, candidates are 
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101 * denoted A(s,t) = u. The laxgest of the candidates is installed as the 

102 * next pivot. There seems to be too much overhead associated with this 

103 * fancy indexing off of p CD and qD . For the paxtial pivoting code, I 

104 * chose to ACTUALLY SWAP rows (if necessaxy) at each stage. This makes 

105 * the 'pp' code a bit easier to read. 

106 * 

107 * 4.) Stopping. The GF process is repeated until one of two criteria is 

106 * satisfied. First, of course, we may run out of matrix. Secondly, we 

109 * may find a pivot whose absolute value is less than our tolerance (tol). 

no * In the latter case, we have a rank-deficient A. Currently, the codes 
in * recognize rank-deficiency and bail out of the iteration loop; but they 
H2 * do not gather (to the host) all of the remaining columns to the right 
H3 * of the last pivot. This is discussed above. 

114 * 

115 * 

116 * ======== ALGORITHM: THE GF PROCESS ======= 

117 * 

116 * 0.) Initialization. Let dim be the dimension of the hypercube. Let 

119 * k = 0. Search A and find the laxgest (in absolute value) element, u. 

120 * This is done at each node. Once each node has a local candidate for 

121 * the next pivot, an election is held, dimension-by-dimension. This 

122 * requires (dim) steps, and when it is finished, every processor knows 

123 * exactly the position and value of the next pivot. Exception: In the 

124 * paxtial pivoting code, the processor which has the pivot column simply 

125 * searches the (proper part of the) pivot column for the next pivot and 

126 * then informs the other processors. 

127 * 

126 * 1.) Status. Every node knows the position and value of the next pivot, 

129 * namely u = A(s,t); and where it should be installed, A(k,k). The growth 

130 * rate is adjusted: g = max[g, abs(u)]. If (u < tol), then A is rank- 

131 * deficient and we exit the loop (using the C 'break 1 statement). 

132 * 

133 * 2.) Permutations. Ve account for the interchange of rows s and k and 

134 * columns t and k by swapping the elements of pQ that axe indexed by k 

135 * and t and swapping the elements in qD indexed by k and s. This 

136 * (effectively) establishes the new pivot at A(k,k). The column permu- 

137 * tation vector has no significance in the paxtial pivoting case since 

136 * it would never be changed. The matrix, P, in this case, is simply the 

139 * identity. 

140 * 

141 * 3.) Adjust the Gauss Transform Area. 

142 * 

143 * (a) In the (single) node that holds the new pivot's column (k) , 

144 * divide every element below the pivot by the pivot value. Broadcast 

145 * this column to every other node. Mode 0 updates the manager, who 

146 * uses this information to append to his copy of the resulting 

147 * (factored) A. 

146 * 

149 * (b) low every worker has the updated column k. At every node, do 

150 * the following: For every element A ( i , j ) [ where i > k and j > k ] 
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let A ( i , j ) = A (i , j ) - (A(i,k) * A(k,j)). 
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151 

152 

153 

154 

155 

156 

157 
156 

159 

160 
161 
162 

163 

164 

165 

166 
167 
166 

169 

170 

171 

172 

173 

174 

175 

176 

177 
176 
179 
160 
161 
162 

163 

164 

165 

166 
167 



/ 



166 / * 



169 * 

190 * 

191 * 

192 * 

193 * 

194 * 

195 * 

196 * 

197 * 

196 * 

199 * 

200 * 



4.) Pivot Search. In the Gauss transform area, G, search for the 
element that is largest in absolute value. Its position is A(s,t) and 
its value is u. The candidates axe chosen at the local (processor) 
level, then an election is held at the global level to determine the 
best candidate in the same manner that was described in step 0. 
Increment k. Repeat the process (go back to step 1). The obvious 
exceptions apply to the paxtial pivoting case. 



=========== MOTES FOR IKPR0VEMEIT ========== 

Currently the code does not give full support for rank-deficiency. It 
DOES break out of the loop, but everything to the right of the final 
pivot column will be gaxbage. It would be relatively easy to add the 
necessaxy post-iteration rank-deficiency check and coalesce each of the 
remaining columns back to the manager, but this code was created to 
test the full-rank cases and take performance data. 

Secondly, there is the issue of whether it is better for the manager to 
receive each pivot column as it becomes available, or if all columns 
should be sent in at the end. I'm not yet sure which method is better, 
but the current code keeps the root node up-to-date at each stage. This 
is probably the best solution to the problem above and would probably 
enhance performance during the iterations! It REALLY SHOULD BE TESTED! 

There axe many other questions that pertain to optimization that remain 
unanswered (especially in the complete pivoting case). 



=========== ALGORITHM: C0HCLUSI0M ========== 

1. ) Rank. Set r, the rank of A, equal to the number of iterations that 

were executed. This is automatic in the manager (host) code since 
the integer, r, is used as the loop index. The worker nodes use k for 
a loop index variable. 

2. ) Interchanges. Row and column interchanges axe not actually done in 
the complete pivoting code. Instead, we maintain permutation vectors, 
p[] and q[] . You may note that while both vectors axe used heavily 
during the GF process q[] , in particular, comes in handy at the end to 
set A in order. The paxtial pivoting code performs the actual inter- 
changes of rows. At first, we would be inclined to believe that the 
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indexing by p[] and q[] leads to better performance, but there is no 
clear timing evidence (at this point) that supports this idea. 
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3.) Factors. The upper trapezoidal matrix, U, is the upper trapezoid 
of (the resulting, factored) A (the diagonal of A and everything above 
that). The lower trapezoidal matrix, L, is formed by placing ones on 
the diagonal of A; zeros above; and copying the lower trapezoid of A 
(excluding the diagonal). To form Q'AP, we use THE ORIGINAL copy of A 
(not the factored, resulting A) and the matrices Q and P that are 
implied by q[] and p[]. That is, in the end, we set Q[q[i]][i] = 1.0 

for all i in { 0, 1 (m-l) > and set P[p[j]] [j] = 1.0 for all j 

in { 0, 1 , . . . , (n-1) >. 



201 
202 

203 

204 

205 

206 

207 

208 

209 

210 
211 
212 

213 

214 

215 

216 

217 

218 

219 

220 

221 /* 

222 * 

223 * 

224 * 

225 * 

226 * 

227 ♦ 

228 * 

229 * 

230 * 

231 * 

232 * 

233 * 

234 */ 

235 

236 

237 #if def TRANSPUTER 

238 

239 #def ine CUBESIZE 

240 #def ine DIMENSION 

241 

242 #else /+ iPSC/2 */ 

243 



MANIFEST CONSTANTS 



Section 1: Communications Aids (Message Types and Type Selectors) 

The following manifest constants simplify the communications effort. 

The TRANSPUTER section is fairly general in nature. The iPSC/2 section 
specifies types and type selectors for csend() and crecv(). It IS 
SIGNIFICANT that N0DE_0FFSET is the largest of these. It must remain 
the largest so that (for all nodes n) the value of (n + N0DE_0FFSET) 
cannot be equal to one of the other message types (consider n == 0) . 



/* change these for a cube of other dim 



♦/ 



244 


#def ine 


ARG_TYPE 


1 


/* 


for 


passing command line argument 


info 


*/ 


245 


#def ine 


C0L_SIZE_TYPE 


2 


/• 


for 


sending n part of size(A) ==> 


cols 


*/ 


246 


#def ine 


C0L_TYPE 


3 


/+ 


use 


this to send a column 




♦/ 


247 


#define 


PIV0T_TYPE 


4 


/* 


candidate for next pivot 




*/ 


248 


#def ine 


PC0L.TYPE 


5 


/* 


use 


this to send a pivot column 




*/ 


249 


#def ine 


R0V_SIZE_TYPE 


6 


/* 


for 


sending m part of size(A) ==> 


rows 


*/ 


250 


#def ine 


N0DE_0FFSET 


7 


/* 


for 


sending messages from nodes 




*/ 




gf.h 



251 

252 

253 

254 

255 

256 

257 
256 

259 

260 
261 
262 

263 

264 

265 

266 
267 
266 

269 

270 

271 

272 

273 

274 

275 

276 

277 
276 
279 
260 
261 
262 

263 

264 

265 

266 
267 
286 
269 

290 

291 

292 

293 

294 

295 

296 

297 
296 

299 

300 



♦endif 



Section 3: Timing 

The root uses a two-dimensional array where the rows are indexed by the 
node numbers and the columns use the following indexing. The nodes, of 
course, only need a one-dimensional array with indexing according to 
the following scheme. There a total of MAX_EVENTS elements in the 
array, and indexing for a specific event is given by START.TIME, SETUP, 
and so on. The partial pivoting case does not use all of the events. 



#def ine MAX.EVENTS 



#def ine DATA_S0URCE 
#def ine STARTJTIME 
#def ine SETUP 
#def ine DISTRIB_C0LS 
#def ine FIRST PIVOT 



18 /* number of events that we want to time 



*/ 



0 /* node number of source of the data */ 

1 /* t(0) ==> starting time for the node */ 

2 /* from t(0) until starting to receive cols */ 

3 /* time to distribute columns */ 

4 /* from receipt of last col to start iter */ 



/* The next two only apply to nodes zero and eight */ 
#define PC0LS_T0_H0ST 5 /* time spent passing pivot cols to host */ 
#define PIV0TS_T0_H0ST 6 /* time spent passing pivots to host */ 

/* The next five kind of represent the big picture */ 

time spent on pivot elections */ 
time spent updating permutations p and q */ 
time spent on pivot column arithmetic */ 
time spent distributing pivot columns */ 
time spent updating the Gauss transform */ 



#def ine PIV0T.ELECTI0N 7 /* 
♦define UPDATING.PQ 8 /♦ 
#def ine PCOL.ARITHMETIC 9 /* 
#def ine PC0L_DISTRIB 10 /* 
tdefine UPDATING.G 11 /* 



/* The next four sore times from within update_G() 



♦/ 



♦define PRLTIME 12 
♦define LCTIME 13 
♦define G.ARITHMETIC 14 
♦define L00PTIME 15 



/* pivot row location time */ 
/♦ time to determine if a column is local */ 
/* time spent on arithmetic within G */ 
/♦ time for both for() loops in update_G() */ 



/* The last two are back at the big picture level again */ 
♦define ITERATION 16 /* time checked before and after iteration */ 
♦define STOP 17 /* the last time sampled by the node */ 
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301 

302 

303 

304 

305 

306 

307 

308 

309 

310 

311 

312 

313 

314 

315 

316 

317 

318 

319 

320 

321 

322 

323 

324 

325 

326 

327 

328 

329 

330 

331 

332 

333 

334 

335 

336 

337 

338 

339 

340 

341 

342 

343 

344 

345 

346 

347 

348 



Section 4: General 



A - 

* 

* 

♦ 

♦ - 

*/ 



#def ine AFT 
#def ine WIDTH 



4 A number of digits to print after decimal */ 

6 A number of characters (including decimal) */ 



A 

♦ 



♦ Section 5: A special flag used for the id field of a pivot. When it 

♦ appears, it indicates that the sending node's part of A has 

♦ no elements as big as the tolerance, tol; and therefore this node's 

♦ candidate for pivot should not be considered. 



*/ 



#def ine RANK.DEFICIENT -1 






typedef struct { 

int id; 

double u; 
int s , 

t; 



TYPE DEFINITIONS 



*/ 



} Pivot_Type; 



A 



EOF gf.h =============== */ 
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PROGRAM INFORMATION 



2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 



* 

* 

♦ 

♦ 

* 

♦ 

♦ 

♦ 

♦ 

* 

♦ 

♦ - 
*/ 



SOURCE 

VERSION 

DATE 

AUTHOR 



gfpphost . c 

2.0 

21 September 1991 

Jonathan E. Hartman, U. S. Naval Postgraduate School 



DESCRIPTION 



Gauss Factorization (GF) with Partial Pivoting: Parallel Version. 

This is the manager portion of the code. See [gf.h] for details. 



15 

16 tinclude <stdio.h> 

17 #include <string.h> 

18 



19 #if def 

20 

21 #include 

22 #include 

23 

24 #include 

25 #include 

26 #include 

27 #include 

28 #include 

29 tinclude 

30 #include 

31 #include 

32 #include 

33 #include 

34 

35 #else 

36 

37 tinclude 

38 #include 

39 tinclude 

40 tinclude 

41 tinclude 

42 tinclude 

43 tinclude 

44 tinclude 

45 tinclude 

46 tinclude 

47 tendif 

48 

49 tinclude 

50 



TRANSPUTER 
<conc .h> 

<stdlib.h> /* 

Cmatrix . h> 
cmacros .h> 

<allocate ,h> 

<clargs . h> 

<comm.h> 

<epsilon.h> 

<generate .h> 

<io . h> 

<ops .h> 

<timing . h> 

/* iPSC/2 */ 

"/usr/hartman/matlib/matrix . h" 
M /usr/hartman/mat lib/macros .h" 
M /usr/hartman/mat lib/allocate . h" 
M /usr/hartman/matlib/ clargs .h M 
Vusr/hartman/matlib/comm.h" 
M /usr/hartman/matlib/epsilon.h M 
"/usr/hartman/matlib/generate .h" 
M /usr/hartman/matlib/io .h M 
M /usr/hartman/matlib/ops .h" 
"/usr/hart man/mat lib/t iming.h" 



M gf -h" 



addfree(), _heapend 



*/ 
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51 

52 

53 /* ==== = = = = = = = MANIFEST CONSTANTS = ========== 

54 * 

55 * The following manifest constants axe used to determine the size of the 

56 * option list, optv[]; indexing associated with valid command line 

57 * arguments; and selection constants for the user's choice of matrix type 

56 * [used in generateQ] . 

59 * 

60 */ 

61 



62 


#def ine 


NUMBER.OF.ARGS 


3 


/* 


-d -t -v 


*/ 


63 














64 


#def ine 


DIM 


0 


/* 


index into optv[] 


*/ 


65 


#def ine 


TIMING 


1 


/* 


M II II 


*/ 


66 


#def ine 


VERBOSE 


2 


/* 


II II II 


*/ 


67 

66 


#def ine 


SELECT. QUIT 


0 


/* 


menu / matrix selection 


*/ 


69 


#def ine 


SELECT. IDENTITY 


1 








70 


#def ine 


SELECT. HILBERT 


2 








71 


#def ine 


SELECT.RAND0M 


3 








72 


#def ine 


SELECT. WILKINSON 


4 









73 

74 

75 

76 

77 

76 /* == = = === = = ======== globals ================= */ 

79 

60 

61 static char version[] = "Parallel GF with Partial Pivoting, Version 2.0"; 

62 

63 

64 #if def TRANSPUTER 

65 

66 Channel *ic [(CUBESIZE + 1)], 

67 *oc [(CUBESIZE + 1)]; 

66 

69 #else /* iPSC/2 */ 

90 

91 static char *cubename; 

92 

93 static char *nodecode = "gfppnode"; 

94 

95 #endif /* TRANSPUTER */ 

96 

97 

96 static Arg_Struct *optv [NUMBER.OF.ARGS] ; 

99 

100 
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101 

102 

103 

104 /♦ =========== FUNCTION DEFINITION =========== 

105 * 

106 * The structure is defined more carefully in clargs.h, but the basic idea 

107 * is that we have an array of pointers to type Arg_Struct . . . in this case, 

106 * there are IUMBER_0F_ARGS valid arguments and the next few steps take 

109 * care of allocation and definition of them. The -d argument allows the 

110 * user to enter the desired dimension of the hypercube, -t sets timing on 

111 * and -v is used to set verbose on. 

112 */ 

113 

114 void def ine_valid_args() { 

115 

116 static int interpret [] = { LONG >; 

117 
116 

119 install_complex_arg(DIM, optv, "-d", interpret, 1); 

120 

121 install_simple_arg(TIMING , optv, "-t"); 

122 install_simple_arg(VERBOSE, optv, M -v"); 

123 

124 } 

125 /* End def ine_valid_args ( ) */ 

126 

127 

128 

129 

130 

131 /* =========== FUNCTION DEFINITION =========== 

132 * 

133 * A simple function to display the results.... 

134 */ 

135 

136 #if def PROTOTYPE 

137 



136 void 


display_timing_data(Double_Matrix_Type 


♦A, 


139 


int 


dim, 


140 


double 


a, 


141 


double 


eps, 


142 


double 




143 


double 


tol, 


144 


int 


r, 


145 


double 


**t) 


146 






147 #el8e 






146 






149 void 


display_timing_data(A , dim, a, eps , g. 


tol, 


150 







320 



gfpphost.c 



151 






Double_Matrix_Type 


♦A; 


152 






int 


dim 


153 






double 


a, 


154 








eps 


155 










156 








tol 


157 






int 


r ; 


156 






double 


**t 


159 










160 


tendif 






161 


{ 








162 




int 


aft , 




163 






cubesize = pow2(dim), 


164 






i. 




165 






m = A->rows , 




166 






n - A->cols, 




167 






width; 




166 










169 











170 #ifdef TRANSPUTER /* is measured in 64 microsecond ticks ==> 4-5 places */ 



171 

172 aft = 5; 

173 width = 15; 

174 

175 #else /* iPSC/2 is measured in milliseconds ==> three places*/ 

176 

177 aft = 3; 

176 width = 13; 

179 

180 #endif 

181 

182 printf ( ” ========= TIMING DATA ========= "); 

183 printf (” \n\n") ; 

164 

165 printf (" Hypercube of order V*d ", dim); 

186 (dim == 0) ? (printf(”(l processor)\n\n") ) • 

187 (printf (”('/.d processors)\n\n” , cubesize)); 

188 

189 printf ("Problem size ==> size(A) = ('/.d x '/,d).\n", m, n) ; 

190 printf ("Machine precision: eps = '/,e\n", eps); 

191 printf ("Tolerance : tol = '/.e\n" , tol); 

192 printf ("Growth factor: g/a = y*e\n" , (g/a)); 

193 printf ("Rank : rank(A) ='/,3d\n" , r ); 

194 printf ("Units for timing data: = seconds\n"); 

195 

196 for (i = 0; i < cubesize; i++) { 

197 

196 printf ("\nNode '/.2d Data ”, i); 

199 printf ( " \n\n" ) ; 

200 
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201 

202 

203 

204 

205 

206 
207 
206 

209 

210 
211 
212 

213 

214 

215 

216 
217 
216 

219 

220 
221 
222 

223 

224 

225 

226 
227 
226 

229 

230 

231 

232 

233 

234 

235 

236 

237 
236 

239 

240 

241 

242 

243 

244 

245 

246 

247 
246 

249 

250 



print! ("Setup and initialization: M ) ; 

print! ("X* . *1!" , width, aft, t [i] [SETUP] ) ; 
print! ("\nlnitial column distribution: "); 

print! ("X* . *1!" , width, aft, t [i] [DISTRIB_C0LS] ) ; 

if (i == 0) { 

print! ("\nTransmission of pivot columns to the host: 
print! ("X* . *1! M , width, aft, t [i] [PC0LS _T0_H0ST] ) ; 
print! ("\nTransmission of pivots to the host: 
print! ("X*.*1! M , width, aft, t [i] [PIVOTS_T0_H0ST] ) ; 



M ); 



print! ( "\nPerf ormance of pivot column arithmetic: ") ; 

print! ( M X*.*1! M , width, aft, t [i] [PCOL.ARITHMETIC] ) ; 
print! ("\nDistribution of pivot columns: "); 

print! ("X* . *1! " , width, aft, t [i] [PC0L.DISTRIB] ) ; 
print! ( M \nPer! ormance of updates and arithmetic in G: "); 

print! ("X* . *1!" , width, aft, t [i] [UPDATING.G] ) ; 
print! ("\nUpdate_G( ) : loop time including arithmetic: ") ; 

print! ("X* . *1!" , width, aft, t [i] [L00PTIME] ) ; 

print! ("\n\nTime for all work inside main iteration loop: ") ; 
print! ("X*. ♦If", width, aft, t [i] [ITERATION] ) ; 
printf ("\nTotal time from start to stop: "); 

print! ("X* . *1! \n\n" , width, aft, (t [i] [STOP] -t [i] [START_TIME] ) ) ; 



/* End display_timing_data() 



♦/ 



/* 

♦ 

♦ 

* 

♦ 

* 

* 

♦ 

* 

♦ 



FUNCTION DEFINITION 



This function distributes the columns of A to the nodes of the hyper- 
cube. The loop variable, j, designates each column of A in turn. The 
column buffer, cbuf [] , copies from A the column to be transmitted. 
After cbuf[] is filled, [i = (j mod cubesize)] means that node i will 
get column j and the modulus operation seems to be a reasonable and 
efficient scheme of distribution. Finally, the call to sendQ ships 
the column out to the appropriate node. 



*/ 



#if def PROTOTYPE 
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251 void distribute. columns (Double.Matrix.Type *A, int dim, double *cbuf) 

252 

253 #else 

254 

255 void distribute. columns (A , dim, cbuf) 

256 

257 Double.Matrix.Type *A; 

258 int dim; 

259 double *cbuf; 

260 

261 tendif 

262 { 

263 

264 int i, 

265 j , 

266 pos = 42, /* position of print head */ 

267 rm = LINE. LENGTH - 10 ; /* right margin (see matrix. h) */ 

266 

269 long cubesize = pow2(dim), 

270 sizeof.col = (long) (A->rows * sizeof (double) ) ; 

271 

272 

273 printf (’’Distributing the columns of A to the nodes"); 

274 

275 for (j = 0 ; j < A->cols ; j++) { 

276 

277 for (i = 0; i < A->rows; i++) { cbuf[i] = A->matrix [i] [j] ; } 

276 

279 



260 


i = j */, cubesize; 


/* 


column — > 


node i 


♦/ 


261 

282 #ifdef 


TRANSPUTER 


A 


node 0 has 


to sort 'em out 


♦/ 



263 

264 

285 

286 
267 
266 
269 

290 

291 

292 

293 #else 

294 

295 

296 

297 #endif 
296 

299 

300 



if (i < 8) { 

send(0, (char *) 

> 

else { 

send(8, (char *) 

> 

/* iPSC/2 ♦ / 
send(i, (chair*) cbuf 
/* TRANSPUTER */ 
printf (”."); 



cbuf, sizeof.col, cubesize); 



cbuf, sizeof.col, cubesize); 



sizeof.col, C0L.TYPE); 
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301 i f (pos++ > rm) { 

302 

303 pos = 0; 

304 print* ("\n M ) ; 

305 > 

306 

307 } 

308 

309 print* ("\nColumn distribution complete \n\n" ) ; 

310 

311 > 

312 /* End dis tribute ^.columns ( ) */ 

313 

314 

315 

316 

317 

318 /* = = = = ===== = = FUNCTION DEFINITION = = = = === = = == 

319 * 

320 * This function prompts the user for matrix size and type, then generates 

321 * the matrix with a call to a function from generate. c. 

322 */ 

323 

324 

325 #if def PROTOTYPE 

326 

327 Double_Matrix_Type ^generate ( int ♦m, int *n) 

328 

329 #else 



330 

331 Double_Matr ix.Type *generate(m, n) 

332 

333 int *m, 

334 *n; 

335 # end if 

336 { 

337 Double_Matr ix.Type *A; 

33* 

339 int matrix^type, 

340 valid = FALSE; 

341 

342 

343 print* ("Please enter the number of rows in A: "); 

344 scan* ( n */,d" , m) ; 

345 f f lush(stdin) ; 

346 

347 print* ("\n and the number of columns in A: *'); 

348 seem* ("'/,d" , n) ; 

349 f f lush(stdin) ; 

350 
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351 print! ("\n\nSelect from the following list of matrices : ") ; 

352 

353 while (‘valid) { 

354 

355 

356 

357 
356 

359 

360 

361 

362 

363 

364 

365 

366 

367 
366 

369 

370 

371 

372 

373 

374 

375 > 

376 

377 

376 swit ch(matrix.type) { 

379 

360 case SELECT.IDENTITY: 

361 

362 print! ( M \n\nGenerating A = identity ('/,d, '/,d).\n\n M , *m, *n) ; 

363 

364 A = identity (*m, *n) ; 

365 break; 

366 

367 case SELECT.HILBERT: 

366 

369 print! ("\n\nGenerating A = hilbertC/.d, */*d) . \n\n" , *m, *n); 

390 

391 A = hilbert(*m, *n) ; 

392 break; 

393 

394 case SELECT_RANDOM : 

395 

396 print! ("\n\nGenerating A = mxrandC'/.d, */,d) . \n\n M , *m, *n) ; 

397 

396 A = mxrand(*m, *n) ; 

399 breaLk; 

400 



print! ("\n\n") ; 



print! (" 
print! ( M 
print! (" 
print! (" 
print! ( M 



# /.d.) QUIT \n’\ 

*/,d.) Identity \n", 
'/.d.) Hilbert 
'/*d.) Random \n‘' 



SELECT.QUIT ) 
SELECT. IDENTITY ) 
\n M , SELECT. HILBERT ) 



'/,d.) Wilkinson \n M , 
print! ( M \n>°) ; 
scanfC'/.d", Jtmatrix.type) ; 
fflush(stdin) ; 

switch(matrix.type) { 



SELECT.RAFDOM ) 
SELECT.WILKINSON) 



case SELECT. IDENTITY 
case SELECT. HILBERT 
case SELECT.RANDOM 
case SELECT. WILKINSON 

case SELECT.QUIT 



valid = TRUE; break; 
exit (EXIT. SUCCESS) ; 



/* end while() */ 
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401 

402 

403 printf ( M \n\nGenerating A = Wilkinson ( %d , */,d).\n\n M , ♦m, *n) ; 

404 

405 A = wilkin8on(*m, *n) ; 

406 break; 

407 > 

408 

409 

410 i f ( ! A) { 

411 

412 printf ("generate () : Allocation failure for the matrix A.\n M ); 

413 exit (EXIT.FAILURE) ; 

414 > 

415 

416 return(A); 

417 

418 > 

419 / ♦ End generate ( ) ♦ / 

420 

421 

422 

423 

424 

425 /♦ ============ FUNCTION DEFINITION ============ 

426 * 

427 * Collect timing data from the nodes. The Intel side of this function 

428 ♦ takes advantage of the host's ability to receive from any node. The 

429 * transputer side must receive every node's information from nodes zero & 

430 ♦ eight (eight only becomes involved in the case of the hybrid 4-cube) . 

431 ♦ / 

432 

433 #if def PROTOTYPE 

434 

435 double ♦♦receive_timing_data(int cubesize) 

436 

437 #else 

438 

439 double ♦♦receive.t iming_data(cubesize) 

440 

441 int cubesize; 

442 

443 #endif 

444 { 



445 


double 


♦ ♦dt ; 


/* 


(double) version of t [] [] 


*/ 


446 












447 


int 


i. 








448 




j; 








449 












450 


long 


tlen ; 


/* 


length of one node's data 


*/ 
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451 

452 

453 

454 

455 

456 

457 
456 

459 

460 

461 

462 

463 

464 

465 

466 

467 
466 

469 

470 

471 

472 

473 

474 

475 

476 

477 
476 
479 

460 

461 

462 

463 

464 

465 

466 

467 
466 
469 

490 

491 

492 

493 

494 

495 

496 

497 
496 

499 

500 



ticks **t; 



/* raw timing data from nodes */ 



/♦ 

♦ Perform allocation for the timing dt t CD Cl - The two-dimensional 

♦ array is indexed by node number for the rows and by event for the 

♦ columns. For instance, t[i][j] means the time required for event 

♦ j at node i. Actually, there is an extra row reserved at the end 

♦ of t [] [] for totals: t [cubesize] [j] gives the total time for event 

♦ j across all nodes. 

♦/ 

if ( ! (dt = (double **) malloc( (cubesize+l) * sizeof (double*) )) ){ 

printf ( M receive_timing_data() : Allocation failure for dt[][].\n"); 

exit (EXIT_FAILURE) ; 

> 

for (i =0; i < (cubesize + 1); i + + ) { 

if ( ! (dt [i] = (double *)calloc(MAX_EVENTS , sizeof (double) )) ){ 

printf ("Host : Allocation failure for dt ['/,d] . \n" , i) ; 

exit (EXIT_FAILURE) ; 

> 

> 

if ( ! (t = (ticks **) malloc( (cubesize+l) * sizeof (ticks*) )) ) { 

printf ("receive_timing_data() : Allocation failure for t[][].\n"); 

exit (EXIT_FAILURE) ; 

> 

for (i = 0; i < (cubesize + 1); i++) { 

if (!(t[i] = (ticks *) calloc(MAX_EVENTS, sizeof (ticks) )) ) { 

printf ("Host : Allocation failure for t[’/,d] An", i) ; 

exit (EXIT.FAILURE) ; 

> 

> 

printf ("Receiving timing data from the nodes"); 
tlen = (long) (MAX_EVEHTS * sizeof (ticks) ) ; 
for (i = 0; i < cubesize; i++) { 
printf ("."); 
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501 

502 

503 

504 

505 

506 

507 
506 

509 

510 

511 

512 

513 

514 

515 

516 

517 
516 

519 

520 

521 

522 

523 

524 

525 

526 

527 
526 

529 

530 

531 

532 

533 

534 

535 

536 

537 
536 

539 

540 

541 

542 

543 

544 

545 

546 

547 
546 

549 

550 



#ifdef TRANSPUTER 

i f (i < 8) receive(0, (char *) t [i] , tlen, cubesize); 
else receive(8, (chair *) t [i] , tlen, cubesize); 

#else /♦ iPSC/2 */ 

receive(i, (char *) t [i] , tlen, (i + I0DE.0FFSET)) ; 

#endif /♦ TRANSPUTER ♦/ 

> 

printf ( M \n\n M ) ; 

/* Calculate totals, averages; place totals in t [cubesize] first. 

* then copy to dt [] [] and record averages in dt [cubesize] . 

*/ 

for (i =0; i < cubesize; i++) { 

for (j = 0; j < MAX_EVENTS; j++) t [cubesize] [j] += t [i] [j] ; 

> 

/* Fill dt[][] with double values (in seconds). The conversion 

* factors axe borrowed from timing. h. 

♦/ 

for (i = 0; i <= cubesize; i++) { 

dt[i] [DATA.SOURCE] = (double) t [i] [DATA.SOURCE] ; 
for (j = START.TIME; j < MAX.EVENTS; j++) { 

#if def TRANSPUTER 

dt [i] [j] = ((double) t[i][j]) ♦ L0.PERI0D; 

#else 

dt [i] [j] = ((double) t [i] [j] ) * M.PERI0D; 

#endif 

> 

> 

/* Convert totals to averages in dt [cubesize] 
for (j = START.TIME; j < MAX.EVENTS; j++) { 



♦/ 
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551 

552 dt [cubesize] [j] /= ((double) cubesize); 

553 > 

554 

555 

556 for (i = 0 ; i < (cubesize + 1); i + + ) free(t[i]); 

557 free(t); 

556 

559 return(dt); 

560 > 

561 /♦ End receive_t iming_data( ) ♦ / 

562 

563 

564 

565 

566 

567 /♦ ============ FUNCTION DEFINITION ============ 

566 * 

569 * This function analyzes the command line that the user supplied and sets 

570 * variables accordingly. The valid arguments axe given by def ine_valid_ 

571 * axgs(), and the real work is passed off to interpret_axgs ( ) , from the 

572 * claxgs library. 

573 ♦ / 

574 

575 #if def PROTOTYPE 



576 

577 
576 
579 

560 

561 

562 

563 

564 

565 

566 

567 
566 
569 

590 

591 

592 

593 

594 

595 

596 

597 
596 



void resolve_args (int axgc, char ♦axgv[], 

int ♦dim, int ♦timing, int ♦verbose) 

#else 

void resolve_args (axgc , axgv, dim, timing, verbose) 

int axgc; 
char *argv [] ; 
int *dim, 

♦timing , 

♦verbose ; 

#endif 

{ 

int maxdim = 3, 

valid = FALSE; 



interpret.axgs (axgc , axgv, NUMBER_0F_ARGS , optv) ; /♦ see claxgs. h ♦/ 

#if def TRANSPUTER 



599 

600 ♦dim = DIMENSION; 
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601 

602 

603 

604 

605 

606 
607 
606 

609 

610 
611 
612 

613 

614 

615 

616 

617 

618 

619 

620 
621 
622 

623 

624 

625 

626 
627 
626 

629 

630 

631 

632 

633 

634 

635 

636 

637 

638 

639 

640 

641 

642 

643 

644 

645 

646 

647 

648 

649 

650 



#else /♦ iPSC/2 ♦/ 

if (optv [DIM] ->found) *dim = (int) optv[DIM] ->lsa[0] ; 
switch (*dim) { 

case 0: case 1: case 2: case 3: break; 

default: while (! valid) { 

printf ("Enter desired cube dimension (0...%d): ", maidim) ; 
scanf ("'/,d" , dim); 
ff lush(stdin) ; 

switch(*dim) { 

case 0: case 1: case 2: case 3: 

valid = TRUE; 
break ; 

> 

> 

> /* end switchO */ 

#endif /* TRANSPUTER */ 

(optv [TIMING] ->found) ? (^timing = TRUE) : (^timing = FALSE); 

(optv [VERBOSE] ->found) ? (*verbose = TRUE) : (*verbose = FALSE); 

printf ("Argument resolution complete ... \n\n") ; 
printf (" Cube Dimension: */,d\n" , *dim) ; 



if (^timing) printf (" 

(♦verbose) ? (printf (" 

(printf ("\n")) ; 



/* End resolve_args() 



Timing: 0N\n"); 

Verbose Mode: 0N\n\n")) 



*/ 



* 

+/ 

#if def PROTOTYPE 



FUNCTION DEFINITION 
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651 

652 

653 

654 

655 

656 

657 
656 

659 

660 
661 
662 

663 

664 

665 

666 
667 
666 

669 

670 

671 

672 

673 

674 

675 

676 

677 
676 
679 
660 
661 
662 

663 

664 

665 

666 
667 
666 
669 

690 

691 

692 

693 

694 

695 

696 

697 
696 

699 

700 



void show_result ing_matrices (Double_Matrix_Type *A, 

Double_Matrix_Type ♦AO, int ♦q) 

#el8e 

void show_result ing_matrices (A , AO, q) 

Double_Hatrix_Type ♦A, 

♦ AO; 

int *q; 

#endif 

{ 

Double_Matrix_Type ♦D , 

♦L, 

♦LU, 

*P, 

*QT, 

♦ QTA, 

♦QTAP, 

♦U; 

int i , 

j. 

m = A->rows, 
n = A->cols; 

print! ("Gauss Factorization Complete ... \n\n") ; 
strcpy (A->name , "A (after GF operations)"); 



/♦ Allocate and form Q J and P 

if (!(QT = matalloc(m,m))) { 

print! ("Allocation failure for QT.\n"); 
exit (EXIT_FAILURE) ; 

> 

strcpy (QT->name , "Q Tramspose") ; 

for (i = 0; i < m; i++) { QT->matrix[i] [q[i]] 

if ( ! (P = identity(n,n))) { 

print! ("Allocation failure for P.\n"); 
exit (EXIT.FAILURE) ; 



*/ 



= 1 . 0 ; > 
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701 

702 

703 

704 

705 

706 

707 

708 

709 

710 

711 

712 

713 

714 

715 

716 

717 
716 

719 

720 

721 

722 

723 

724 

725 

726 

727 
726 

729 

730 

731 

732 

733 

734 

735 

736 

737 
736 

739 

740 

741 

742 

743 

744 

745 

746 

747 
746 

749 

750 



> 

strcpy (P->name , "P [ Partial (column) Pivoting ==> P == Identity ]"); 



/* Here, we slowly form Q'AP, keeping in mind that the A we sure 

* talking about is the original A.... and we have labeled that one 

* A0. Therefore, we first form QTA (Q'A) as Q * * A0. After we 

* have QTA, we can multiply it (on the right) by P to get Q'AP, 

* or QTAP as it is called here. 

*/ 

if ( ! (QTA = matalloc(m,n) ) ) { 

printf ("Allocation failure for QTA . \n") ; 
exit (EXIT_FAILURE) ; 



strcpy (QTA->name , "Q' * (original) A"); 

if (matrix_product (QT, A0, QTA) == FAILURE) { 

printf ( "matrix_product (QTA) Failure . \n") ; 
exit (EXIT_FAILURE) ; 



if 



> 



( ! (QTAP = matalloc(m ,n) ) ) { 

printf ("Allocation failure for QTAP . \n") ; 
exit (EXIT_FAILURE) ; 



strcpy (QTAP~>name , M Q' * A * P M ); 

if (matrix_product (QTA , P, QTAP) == FAILURE) { 

printf ("matrix_product (QTAP) Failure . \n") ; 
exit (EXIT_FAILURE) ; 



/* Mext, we form L and U so that we can compare Q'AP ?=? LU. */ 

L = zeros(m, n) ; L->name = "L 

U = zeros(m, n) ; U->name = "U "; 

for (i =• 0; i < A->rows; i++) { 

for (j = 0; j < A->cols; j++) { 
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751 

752 

753 

754 

755 

756 

757 
756 

759 

760 

761 

762 

763 

764 

765 

766 

767 
766 

769 

770 

771 

772 

773 

774 

775 

776 

777 
776 
779 

760 

761 

762 

763 

764 

765 

766 

767 
766 
769 

790 

791 

792 

793 

794 

795 

796 

797 
796 
799 
600 



i f (i < j) { U->matrix [i] [j] = A->matrix [i] [j] ; } 

if (i == j) { 

L->matrix [i] [j] = 1.0; 

U->matrix [i] [j] = A->matrix [i] [j] ; 

> 

i f (i > j) { L->matrix[i] [j] = A->matrix [i] [j] ; > 

> 

> 

i f ( ! (LU = matalloc(m,n) ) ) { 

print* ("Allocation failure for LU.\n M ); 
exit (EXIT_FAILURE) ; 



strcpy (LU->name , M L * U"); 

if (matrix_product (L , U, LU) == FAILURE) { 

printf ("matrix.product (LU) Failure . \n") ; 
exit (EXIT_FAILURE) ; 



/* Finally, we create a matrix of differences between the elements 

* found in QTAP (Q^P) and LU. If everything proceeded according 

* to the plan, this will be a matrix of zeros. 

*/ 

if ( ! (D = matalloc(m,n))) { 

print* ("Allocation failure for D.\n"); 
exit (EXIT_FAILURE) ; 



strcpy (D->name, "Q'AP - LU M ); 
for (i = 0; i < m; i++) { 

for (j = 0; j < n; j++) { 

D->matrix [i] [j] = (QTAP->matrix [i] [j] - LU->matrix [i] [j] ) ; 

> 

> 

printmd(*A , VIDTH, AFT); 
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801 


printf ("\n\n") ; 


802 


printmd(*L, WIDTH, AFT); 


803 


printf ("\n\n") ; 


804 


printmd(*U , WIDTH, AFT); 


805 

806 


printf ("\n\n") ; 


807 


printmd(*QT, WIDTH, AFT); 


808 


printf ( M \n\n") ; 


809 


printmd(*P , WIDTH, AFT); 


810 


printf ("\n\n") ; 


811 


printmd(*QTA , WIDTH, AFT); 


812 


printf ("\n\n") ; 


813 


printmd(*QTAP , WIDTH, AFT); 


814 


printf ("\n\n") ; 


815 


printmd(*LU, WIDTH, AFT); 


816 


printf ("\n\n") ; 


817 


printmd(*D , WIDTH, AFT); 


818 

819 

820 > 


printf ("\n\n") ; 



621 /* End show_resulting_mat rices ( ) */ 

822 

823 

824 

825 

826 

827 /* ============ FUNCTION DEFINITION ============ 

828 * 

829 * This is a simple function to physically swap the elements from row s to 

830 * the current pivot row, r. It does not concern itself with column r or 

831 * any column j > r. 

832 */ 

833 

834 #if def PROTOTYPE 



835 

836 void swap_rows_lef t_of _pivot (Double_Hatrix_Type *A, int r, int s) 

837 

838 #else 

839 

840 void swap_rows_lef t_of_pivot (A , r, s) 

841 

842 Double_Matrix_Type ♦A; 

843 int r, 

844 s ; 

845 

846 #endif 

847 { 

848 double tmp; 

849 

850 int i ; 



int j ; 
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851 

852 

853 for (j = 0; j < r; j + + ) { 

854 

855 tmp = A->»atrix [r] [j] ; 

856 A->matrix [r] [j] = A->matrix [s] [ j] ; 

857 A->matrix [s] [j] = tmp; 

858 > 

859 

860 } 

861 /* End swap_rows_lef t_of _pivot ( ) */ 

862 

863 

864 

865 

866 

867 /* =========== FUNCTION DEFINITION =========== 

868 * 

869 * This function performs updates to a permutation vector, v [] , of length 

870 * 'size'. The pivot_index indicates the row or column where the next 

871 * pivot has been located; and k indicates the stage, or the row and 

872 * column where the pivot is to be installed. 

873 */ 

874 

875 #ifdef PROTOTYPE 

876 

877 void update_permutation(int v[], int size, int k, int pivot_index) 

878 

879 #else 

880 

881 void update_permutation(v , size, k, pivot_index) 

882 

883 int v[], 

884 size, 

885 k , 

886 pivot.index; 

887 

888 #endif 

889 { 

890 int i ; 

891 

892 

893 i = v[k]; v[k] = v [pivot_index] ; v[pivot_index] = i; 

894 > 

895 /* End update_permutation() */ 

896 

897 

898 

899 

900 
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901 tifdef PROTOTYPE /* ================================================= ♦ / 

902 

903 main(int argc, char ♦ argv[]) 

904 

905 #else 

906 

907 main(argc, argv) 

90S 

909 int argc; 

910 char ♦argv[]; 

911 

912 #endif 

913 { 

914 



915 /♦ 






:able 


DEFINITIONS ========== 


*/ 


916 












917 


double a, 




/* 


denominator of growth factor (g/a) 


*/ 


918 


♦ cbuf , 




/* 


col buffer holds one col at a time 


*/ 


919 


♦♦dtime , 




/* 


doubles corresponding to ticks ♦♦t 


*/ 


920 


eps = 


epsd() , 


/* 


machine precision (see machine. h) 


*/ 


921 


g 


0.0, 


/* 


the growth factor 


*/ 


922 


root_t ime t 




/* 


time measured at root for iterations 


*/ 


923 


tol ; 




/* 


tolerance 


*/ 


924 












925 


Double_Matrix_Type 


♦A, 


/* 


This A gets operated upon/changed 


*/ 


926 




♦A0; 


/* 


The original copy of A 


*/ 


927 












928 


int cubesize, 




/* 


number of processors in the cube 


*/ 


929 


dim, 




/* 


dimension of the hypercube 


*/ 


930 


i , 










931 


j. 










932 


m, 




/* 


number of rows in A 


*/ 


933 


me , 




/* 


root processor's id 


*/ 


934 


n. 




/* 


number of cols in A 


*/ 


935 






/* 


row permutation vector 


*/ 


936 


r , 




/* 


numerical rank estimate 


*/ 


937 


timing, 




/* 


Boolean 


*/ 


938 


verbose ; 




/* 


Boolean 


*/ 


939 












940 


long sizeof_col, 




/* 


sizes , in bytes 


*/ 


941 


sizeof_int , 










942 


sizeof _pivot ; 










943 












944 


ticks root.start, 








945 


t_root , 




/* 


time measured at root transputer 


*/ 


946 


**t ; 




/* 


time data: row => node, col => event 


*/ 


947 












948 


Pivot_Type pivot; 




/* 


pivot 


*/ 



949 

950 
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951 A = = = = = = = = = ==== INITIALIZATIONS ==== ========= 

952 

953 #if def TRANSPUTER 

954 

955 A Add 1M to the heap to allow f or generation of large matrices 

956 addf ree( (void *) .heapend, 0x100000); 

957 

956 #endif 

959 

960 printf ("\n'/,s\n\n" , version); 

961 

962 def ine.valid.argsQ ; 

963 

964 resolve.args (argc , argv, Adim, Atiming, Averbose); 

965 

966 A = generate(Am, An); 

967 

966 sizeof. col = (long) (A->rows * sizeof (double) ) ; 

969 sizeof.int = (long) sizeof (int); 

970 sizeof. pivot = (long) sizeof (Pivot.Type) ; 

971 

972 if (!(cbuf = (double *) malloc(sizeof .col) ) ) { 

973 

974 printf ("mainO : Allocation failure for cbuf[].\n"); 

975 exit (EXIT. FAILURE) ; 

976 } 

977 

976 cubesize = P0V2(dim); 

979 



960 #ifdef TRANSPUTER 



961 

962 

963 

964 

965 

966 

967 
966 
969 

990 

991 

992 

993 

994 

995 

996 

997 
996 
999 

1000 



initialize.hypercube (dim) ; 

#else 

cubename = initialize.hypercube(dim, nodecode); 
tendif 

me = myhost () ; 
if (verbose) { 

if ( ! (A0 = matalloc(m,n)) ) { 

printf ("Allocation failure for A0.\n"); 
exit (EXIT.FAILURE) ; 



•/ 



*/ 
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1001 

1002 

1003 

1004 

1005 

1006 
1007 
1006 

1009 

1010 
1011 
1012 

1013 

1014 

1015 

1016 
1017 
1016 

1019 

1020 
1021 
1022 

1023 

1024 

1025 

1026 
1027 
1026 

1029 

1030 

1031 

1032 

1033 

1034 

1035 

1036 

1037 
1036 

1039 

1040 

1041 

1042 

1043 

1044 

1045 

1046 

1047 
1046 

1049 

1050 



strcpy (A0->name , "Original A”); 

for (i = 0; i < A->rows; i++) { 

for (j = 0; j < A->cols; j++) { 

A0->matrix[i] [j] = A->matr ix [i] [j] ; 



printf ( M \n\nA has been allocated and generated ,\n\n") ; 
printmd(*A , WIDTH, AFT); 

printf ( M \n\nSending size(A) to the nodes . \n\n") ; 



#if def TRANSPUTER 

cubecast(me, dim, (char *) Am, sizeof.int, cubesize) ; 

cubecast(me, dim, (char *) An, sizeof_int, cubesize); 

cubecast (me , dim, (char *) Atiming, sizeof_int, cubesize); 

#else /* iPSC/2 */ 

cubecast(me, dim, (char *) Am, sizeof.int, R0W_SIZE_TYPE) ; 

cubecast(me, dim, (char *) An, sizeof_int, C0L_SIZE.TYPE) ; 

cubecast(me, dim, (char *) Atiming, sizeof_int, ARGJTYPE) ; 

#endif 

if (verbose) printf (”\nSent size(A) to nodes. \n”) ; 
distribute_columns(A , dim, cbuf); 
q = initial_permutation_vector(m) ; 



/* FINAL PREPARATIONS BEFORE STARTING THE ITERATION 

* 

* Get the first pivot from node 0. Initialize the growth factor 

* variables, g and a, so that we can compute growth factor (g/a) as 

* we go. Set a reasonable tolerance. 

* 

* 

*/ 

#if def TRANSPUTER 

receive(0, (chan *) Apivot, sizeof_pivot , cubesize); 

#else /* iPSC/2 */ 
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1051 receive(0, (char *) kpivot, sizeof _pivot , PIV0T_TYPE) ; 

1052 

1053 #endif /* TRANSPUTER */ 

1054 

1055 

1056 a = g = MAX(g, fabs(pivot .u) ) ; 

1057 

1056 tol = (MIN(m ,n) ) * g * eps ; 

1059 

1060 

1061 /* BEGINNING OF ITERATION 

1062 * 

1063 * Ve enter with A established and knowledge of the first pivot. 

1064 * 

1065 * 

1066 */ 

1067 

1066 #ifdef TRANSPUTER 

1069 

1070 root_start = clock(); 

1071 

1072 #endif 

1073 

1074 printf (’’Beginning iterations . \n\n M ) ; 

1075 

1076 for (r = 0; r < (MIN(m.n)); r++) { 

1077 

1076 if (pivot. id == RANK_DEFICIENT) break; 

1079 

1060 /* We expect to receive cbuf [] in the correct (i.e., already 

1061 * swapped) order. Before we stuff cbuf □ into A [] [] , we'll swap 

1062 * rows left of the pivot column, and then insert the new pivot 

1063 * column. 

1064 */ 

1065 

1066 #ifdef TRANSPUTER 

1067 

1066 receive(0, (char *) cbuf, sizeof_col, cubesize); 

1069 

1090 #else /* iPSC/2 */ 

1091 

1092 receive(0, (char *) cbuf, sizeof_col, PC0L_TYPE) ; 

1093 

1094 #endif /* TRANSPUTER */ 

1095 

1096 g = MAX(g, fabs (pivot .u) ) ; 

1097 

1096 update_permutation(q, m, r, pivot. s); 

1099 

1100 if (pivot, s != r) swap_rows_lef t_of _pivot (A , r, pivot. s); 
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1101 

1102 !or (i = 0; i < A->rows; i++) { A->matrix [i] [r] = cbu! [i] ; } 

1103 

1104 i! (verbose) { 

1105 

1106 print! ("Host : Stage */.d, Pivot value = */,e . ", r, pivot, u); 

1107 print! ("Growth !actor = # /,e.\n", (g/a)); 

lios print! ("q = "); printvi(q, A->rows, WIDTH); 

1109 print! ("\n") ; 

1110 

ini > 

1112 

m3 i! (r < ((MIN(m,n)) - 1)) { 

1114 

1115 #i!de! TRANSPUTER 

1116 

ni7 receive(0, (char *) Apivot, sizeo!_pivot , cubesize); 

ins 

1119 #else /* iPSC/2 ♦/ 

1120 

1121 receive(0, (char *) Jtpivot, sizeof.pivot , PIVOTJTYPE) ; 

1122 

1123 #endi! /♦ TRANSPUTER */ 

1124 

1125 > 

1126 

1127 > /* end !or(r) ♦/ 

1128 

1129 #i!de! TRANSPUTER 

1130 

1131 t_root = (clock() - root.start) ; 

1132 

1133 i! (timing) { 

1134 

1135 root.time = ((double) t.root) * L0_PERI0D; 

1136 

1137 print! ("\n\nRoot transputer: "); 

1138 print! ("Time !or iterations: y,8.41! seconds\n\n" , root.time); 

1139 > 

1140 

1141 #endi! 

1142 

1143 

1144 !ree(cbu!); 

1145 

1146 

1147 /♦ I have selected the easy way out and assumed A has !ull rank. I! 

H48 * you did not make this assumption, you would need to collect the 

1149 * remaining columns at this point. 

1150 */ 
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1151 

1152 if (timing) dtime = receive_t iming_data(cubesize) ; 

1153 

1154 

1155 /* There is no more use f or the nodes, so they can be released. */ 

1156 

1157 #ifndef TRAHSPUTER 

H56 print! ( M \n\nmain() : Killing and releasing cube.\n\n"); 

1159 killcube (ALL_H 0 DES , ALL.PIDS) ; 

1160 relcube(cubename) ; 

H 6 i #endif 

1162 

1163 if (verbose) { /♦ Create and show Q\ AO, P, L, U .... */ 

1164 

1165 show_resulting_matrices (A , AO, q) ; 

1166 

1167 > 

1166 

1169 

1170 if (timing) display_t iming_data(A , dim, a, eps, g, tol , r, dtime); 

1171 

1172 > 

1173 /* ============= EOF gfpphost.c ============ */ 
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1 /* ========== PROGRAM INFORMATION ========== 

2 * 

3 * SOURCE : gfppnode.c 

4 * VERSION : 2.0 

5 * DATE : 21 September 1991 

6 * AUTHOR : Jonathan E. Hartman, U. S. laval Postgraduate School 

7 * REMARKS : See gf.h. 

6 * 

10 */ 

11 



12 #include <math.h> 

13 

14 #if def TRANSPUTER 



15 

16 #include 

17 

18 #include 

19 #include 

20 #include 

21 #include 

22 #include 

23 #include 

24 #include 

25 #include 

26 

27 #else 

28 

29 #include 

30 #include 

31 #include 

32 #include 

33 tinclude 

34 #include 

35 #include 

36 #include 

37 #endif 

38 

39 #include 

40 



<conc .h> 

<matrix.h> 
<macros .h> 
<allocate .h> 
<comm.h> 
<generate .h> 
<mathx .h> 
<ops . h> 
<timing.h> 



"/usr/hartman/matlib/matrix . h" 
"/usr/hartman/mat lib/macros .h" 
"/usr/hartman/matlib/allocate . h" 
"/usr/hartman/mat lib/comm .h" 
M /usr/hartman/mat lib/generate .h" 
M /usr/haxtman/matlib/mathx ,h M 
"/usr/hartman/matlib/ops . h” 
"/usr/haxtman/mat lib/timing .h" 



"gi -h" 



41 #if def TRANSPUTER 



42 

43 Channel *ic [(CUBESIZE + 1)] , 

44 *oc [(CUBESIZE + 1)] ; 

45 

46 #endif 

47 



48 

49 ticks t [MAX_EVENTS] ; 

50 
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51 

52 

53 

54 

55 /* = = = = === = = = = FUNCTION DEFINITION ==== = ====== 

56 * 

57 * This function is kind of an inverse for local_column( ) . Given some 

58 * column number (local.column) held at this node, the function returns 

59 * the corresponding column number in the global/host copy of the full- 

60 * sized A. This could be implemented more efficiently as a macro. 

61 */ 

62 

63 #if def PROTOTYPE 

64 

65 int global_column(int local.column , int me, int cubesize) 

66 

67 #else 

68 

69 int global_column(local_column, me, cubesize) 

70 

71 int local_column, 

72 me , 

73 cubesize; 

74 

75 #endif 

76 { 

77 return(local_column * cubesize + me) ; 

76 > 

79 /* End global. column ( ) */ 

60 

81 

82 

83 

64 

85 /* ===== ====== FUNCTION DEFINITION ========= == 

66 * 

87 * This function maps a column number in the global A (the full-sized A 

86 * held at the root processor/host) to the corresponding local column num- 

89 * ber. If the global_column is not one that is held at this node, a 

90 * negative value (-1) is returned. 

91 ♦/ 

92 

93 #if def PROTOTYPE 

94 

95 int local_column(int global_column, int me, int cubesize) 

96 

97 #else 

98 

99 int local_column(global_column , me, cubesize) 

100 
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101 int global. column , 

102 me, 

103 cubesize; 

104 #endif 

105 { 

106 if ( (global. column */, cubesize) ■ = me) retuni(-l); 

107 

106 retum((int) global.column / cubesize); 

109 > 

110 /* End local_column( ) */ 

111 

112 

113 

114 

115 

116 /♦ =========== FUNCTION DEFINITION = = ==== ===== 

117 * 

118 */ 

119 

120 #ifdef PROTOTYPE 



121 

122 void do_pivot_column_arithmetic(Double_Matrix_Type *A, double *cbuf, 

123 int k, int me, int cubesize) 

124 

125 #else 

126 

127 void do.pivot.column.arithmet ic(A , cbuf , k, me, cubesize) 

126 

129 Double.Matrix.Type *A; 

130 double *cbuf; 

131 int k, 

132 me, 

133 cubesize; 

134 

135 #endif 

136 { 

137 double pivot.value; 

138 

139 int i , 

140 pivot. column; 

141 

142 

143 pivot. column = local_column(k , me, cubesize); 

144 

145 pivot.value = A->matrix [k] [pivot. column] ; 

146 

147 

146 /♦ Divide everything under the pivot by the pivot value */ 

149 for (i = (k+1); i < A->rows; i++) { 

150 
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151 A->matrix [i] [pivot_column] /= pivot_value; 

152 > 

153 

154 

155 /* This is somewhat redundant, and not optimal with respect to 

156 * efficiency, but it works and reads clearly, right? 

157 */ 

156 

159 for (i = 0 ; i < A->rows; i++) cbuf[i] = A->matrix [i] [pivot_column] ; 

160 
161 > 

162 /+ End do_pivot_column_arithmetic() */ 

163 

164 

165 

166 
167 

166 /* =========== FUNCTION DEFINITION =========== 

169 * 

170 * This function accepts the matrix, the global column number for this 

171 * stage (where the pivot will be taken from), and a pivot structure to be 

172 * f illed .... among other things .... and 'returns* the row, s, and value, u, 

173 * of the new pivot in global column r (local column 1c) . 

174 */ 

175 

176 #if def PROTOTYPE 

177 

176 void locate_pivot (int me, int cubesize, Double_Matrix_Type *A, int r, 

179 Pivot_Type *pivot) 

160 

161 #else 

162 

163 void locate.pivot (me , cubesize, A, r, pivot) 

164 

165 int me, 

166 cubesize; 

167 Double_Matrix_Type *A; 

166 int r; 

169 Pivot_Type *pivot; 

190 

191 #endif 

192 { 

193 int i, 

194 pivot_column ; 

195 

196 

197 pivot_column = local_column(r , me, cubesize); 

196 

199 /* Initialize pivot row and value */ 

200 pivot->s = r; 
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201 

202 

203 

204 

205 

206 
207 
206 

209 

210 
211 
212 

213 

214 

215 

216 
217 
216 

219 

220 
221 
222 

223 

224 

225 

226 
227 
226 

229 

230 

231 

232 

233 

234 

235 

236 

237 
236 

239 

240 

241 

242 



pivot->u = A->matrix [r] [pivot_column] ; 

for (i = (r+1); i < A->rows; i++) { 

if (fabs(A->matrix[i] [pivot_column] ) > f abs (pivot->u) ) { 
pivot->s = i; 

pivot->u = A->matrix[i] [pivot. column] ; 



> 

/* End locate_pivot () 



*/ 



/♦ ' 

* 

♦ 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

* 

*/ 



FUNCTION DEFINITION 



Receive this node's columns from the root/host processor (manager), 
place them into the column buffer, then transfer them into A while 
the other processors are communicating with the root. 

The transputer scheme is a bit more involved. Here nodes 0000 and 1000 
axe connected to the root and they must receive for everyone. They (0 
and 8) are not directly connected to everyone, so the columns must be 
passed out in cycles. For instance, suppose we used the hybrid 4-cube. 
Then nodes 0 and 8 would receive bursts of 8 columns at a time. They 
would keep the first one (we'll call it column 0 in some sort of rela- 
tive numbering scheme that abides by the C numbering convention), send 
the next one (col 1) in the 0x1 direction, the next to the 0x2 direc- 
tion, column 3 in the 0x1 direction, column 4 in the 0x4 direction, 
column 5 in the 0x1 direction, column 6 in the 0x2 direction, and 
lastly, column 7 in the 0x1 direction. This makes cycle == 8 for nodes 
0000 and 1000. Similarly, nodes xOOl have a cycle of four where they 
keep the first column to arrive and then send the next three to direc- 
tions 0x2, 0x4, and 0x2 in turn. This distribution pattern is main- 
tained until all of the columns have been distributed. 



#if def PROTOTYPE 



243 



244 


void receive_columns(int 


dim, 


245 


int 


node , 


246 


Double_Matrix_Type 


♦A, 


247 


int 


n, 


246 


double 


♦cbuf , 


249 


int 


my_cols , 


250 


int 


colsize) 
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251 

252 #else 

253 



254 


void receive. 


.columns (dim , node, A, 


n. 


cbuf , my_cols, colsize) 




255 














256 




int 


dim, 








257 






node ; 








256 




Double_Hatrix_Type * A ; 








259 




int 


n; 








260 




double 


♦cbuf ; 








261 




int 


my_cols , 








262 






colsize; 








263 














264 


#endif 












265 














266 


int 


cubesize 


= pow2(dim) , 








267 




cycle , 




/* 


length of typical col burst 


*/ 


266 




dimef f 


= MIN (3 , dim) , 


/* 


effective dimension 


*/ 


269 




from , 




/* 


node that I receive from 


*/ 


270 




g c > 




/* 


global column index 


*/ 


271 




i, 










272 




idx , 




/• 


index into to[] 


*/ 


273 




lc 


= o, 


/* 


local column index 


*/ 


274 




ldeff, 




/* 


effective least_dimension( ) 


*/ 


275 




nodeff 


= (node */, 8), 


/* 


effective node number 


*/ 


276 




others , 




/* 


no. of nodes in other 3-cube 


*/ 


277 




step, 




/* 


for destination of cols rec' 


d*/ 


276 




thehost 


= myhost () , 








279 




to [8] ; 




/* 


==> direction to send to 


*/ 



260 

261 

262 #if def TRANSPUTER 



263 

264 

265 

266 
267 
266 
269 

290 

291 

292 

293 

294 

295 

296 

297 
296 

299 

300 



ldeff = least_dimension(nodef f ) ; 

if (nodeff == 0) from = myhostQ; 

else from = node “ pow2(ldeff - 1); 

/* cycle describes the length of a cycle that staxts with me (node).. 

* then I receive several columns for others .... then start over with 

* me. The nodes in the highest dimension have cycle == 1 ==> self 

* only. Ve also fill to[] with the directions that we will be 

* sending to within a given cycle. Not all nodes use all 8 elements 

* of to[] . They only use the first cycle elements. The step is the 

* difference between the column numbers received at this node during 

* a given burst of length cycle. 

* 

* When we use the hybrid 4-cube, we axe treating it as two 3-cubes, 

* so the vaxiable others is set to 8. This is because there axe 8 

* other columns between every burst that comes to the 3-cube that 
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301 * node is in. 

302 */ 

303 cycle = pow2(dimeff - ldeff); 

304 

305 (dim == 4) ? (others = 8) ; (others = 0); 

306 

307 step = pow2(ldeff); 

306 

309 to[0] = 0; 

310 to[l] = to [3] = to [5] = to [7] = pow2(ldeff); 

311 to [2] = to [6] = pow2(ldeff + 1); 

312 to [4] = pow2(ldeff + 2); 

313 

314 

315 for (gc = node; gc < n; gc += (others + step)) { 

316 

317 receive(from, (char *) cbuf, colsize, cubesize); 

316 

319 for (i = 0; i < A->rows; i++) A->matrix [i] [lc] = cbuf[i]; 

320 

321 lc + +; 

322 

323 for (idx = 1; idx < cycle; idx++) { 

324 

325 gc += step; 

326 

327 if (gc < n) { 

326 

329 receive(from, (char *) cbuf, colsize, cubesize); 

330 

331 directional_send(node , dim, to[idx], (char*) cbuf, colsize); 

332 > 

333 > 

334 

335 > /* end for(gc) */ 

336 

337 

336 #else /* iPSC/2 */ 

339 

340 for (lc = 0; lc < my. cols; lc++) { 

341 

342 receive(thehost , (char *) cbuf, colsize, C0L.TYPE) ; 

343 

344 for (i = 0; i < A->rows ; i++) { A->matrix[i] [lc] = cbuf [i] ; > 

345 } 

346 

347 tendif /♦ TRANSPUTER */ 

346 

349 > 

350 /* End receive. columns () */ 
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351 

352 

353 

354 

355 

356 

357 
356 

359 

360 

361 

362 

363 

364 

365 

366 

367 
366 

369 

370 

371 

372 

373 

374 

375 

376 

377 
376 
379 

360 

361 

362 

363 

364 

365 

366 

367 
366 
369 

390 

391 

392 

393 

394 

395 

396 

397 
396 

399 

400 



FUNCTION DEFINITION 



/* ========== 

* 

* This function sends in the timing data that is held in 

*/ 

#if def PROTOTYPE 

void submit.t iming_data(int node, int dim) 

#else 

void submit_timing_data(node , dim) 

int node, 
dim; 



#endif 

{ 



int dimeff = MIN(dim, 3), 
dir, 

i » 

Id = least_dimension(node '/, 8), 

nodef f = (node '/♦ 8) , 

root = myhost () ; 

long cubesize = pow2(dim), 

tlen ; 



tlen = (long) (MAX_EVENTS * sizeof (ticks) ) ; 

#if def TRANSPUTER 

submit(node, dim, (char *) t, tlen, cubesize); 
if (dimeff == Id) return; 
if ((nodeff == 2) I I (nodeff == 3)) { 
if (dimeff > 2) { 

directional_receive(node, dim, 0x4, (char *) t 
submit (node, dim, (char *) t, tlen, cubesize); 

> 

return; 



t[]. 



, tlen); 
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401 

402 

403 

404 

405 

406 

407 
406 

409 

410 

411 

412 

413 

414 

415 

416 

417 
416 

419 

420 

421 

422 

423 

424 

425 

426 

427 
426 

429 

430 

431 

432 

433 

434 

435 

436 

437 
436 

439 

440 

441 

442 

443 

444 

445 

446 

447 
446 

449 

450 



i f (nodeff == 1) { 

if (dimeff > l) { 

directional_receive(node , dim, 0x2, 
submitCnode, dim, (char *) t, tien, 

> 

if (dimeff > 2) { 

directional_receive(node , dim, 0x4, 
8ubmit(node, dim, (char *) t, tlen, 
directional_receive(node , dim, 0x2, 
8ubmit(node, dim, (char *) t, tlen, 

} 

return; 

> 

if (nodeff == 0) { 

if (dimeff > 0) { 

/ * retrans from 1 or 9 

directional_receive(node , dim, 0x1, 
submit(node, dim, (char* *) t, tlen, 

> 

if (dimeff > 1) { 

/* retrans from 2 or 10 

directional_receive (node , dim, 0x2, 
submit (node , dim, (chan *) t, tlen, 

/ * retrans from 3 or 1 1 

directional_receive(node , dim, 0x1, 
8Ubmit(node, dim, (char *) t, tlen, 

> 

if (dimeff > 2) { 

/* retrans from 4 or 12 

directional_receive(node , dim, 0x4, 
8ubmit(node, dim, (char *) t, tlen, 

/* retrans from 5 or 13 

direct ional_receive (node , dim, 0x1 , 
8ubmit(node, dim, (char *) t, tlen, 

/* retrans from 6 or 14 

direct ional_receive(node , dim, 0x2, 
submit(node, dim, (char *) t, tlen, 



(char *) t, tlen); 
cubesize) ; 



(char *) t , tlen) ; 
cubesize) ; 

(char *) t, tlen); 
cubesize) ; 



(char *) t, tlen); 
cubesize) ; 



(char *) t, tlen); 
cubesize) ; 



(char *) t, tlen); 
cubesize) ; 



(chax *) t , tlen) ; 
cubesize) ; 



(char *) t, tlen); 
cubesize) ; 



(char *) t, tlen); 
cubesize) ; 



♦/ 



♦/ 



♦/ 



♦/ 



♦/ 



*/ 
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451 

452 

453 

454 

455 

456 

457 
456 

459 

460 

461 

462 

463 

464 

465 

466 

467 
466 

469 

470 

471 

472 

473 

474 

475 

476 

477 
476 
479 



/♦ retrans from 7 or 15 

directional_receive(node , dim. Oil, (char ♦ ) t, tlen) ; 
submit(node, dim, (char ♦ ) t, tlen, cubesize); 



♦/ 



#else /♦ iFSC/2 ♦ / 

delay(1.0 + 2.0 * (float) node); 

send(root, (char ♦ ) t, tlen, (node + I0DE_0FFSET) ) ; 
#endif /♦ TRANSPUTER ♦ / 



/♦ End submit_timing_data() 



♦/ 



FUNCTION DEFINITION 



/♦ ========== 

♦ 

♦ This function performs the required operations on the Gauss Transform 

♦ area, G, of A and searches for the next pivot. 

♦/ 

#if def PROTOTYPE 



460 








461 


void updat e_G(Double_Matr ix_Type ♦A, double ♦cbuf, 


462 


int 


cubesize, int k, int 


me, int n, Pivot.Type ♦pivot) 


463 








464 


#else 






485 








466 


void update_G(A, 


cbuf , cubesize, k, me. 


, n, pivot) 


467 








466 


Double_Matrix 


-Type ♦ A; 




489 


double 


♦cbuf ; 




490 


int 


cubesize , 




491 




k. 




492 




me, 




493 




n; 




494 


Pivot_Type 


♦pivot ; 




495 








496 


#endif 






497 


{ 






498 


int i, 






499 


j. 






500 


gc = 0, 


/♦ 


global column number 



♦/ 
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501 lc = 0; /* local column number to start */ 

502 

503 ticks start; 

504 

505 

506 while ((gc = global_column(lc , me, cubesize)) <= k) lc++; 

507 
506 

509 /* The pivot row is k and we know that lc is the first local column to 

510 * the right of k. low we must move through the Gauss Transform area, 

511 * all A(i,j) where i > k and j > k, and perform the operation: 

512 * 

513 * A (i , j ) = A(i,j) - A(i,k) * A(k,j) <==> A(i,j) -= cbuf [i] *A(k, j) 

514 */ 

515 

516 start = clockQ; 

517 

516 for (i = k+1; i < A->rows; i++) { 

519 

520 for (j = lc; j < A->cols; j++) { 

521 

522 A->matrix [i] [j] -= (cbuf[i] * A->matrix [k] [j] ) ; 

523 

524 } /* end for(j) */ 

525 

526 } /* end for(i) */ 

527 

526 t [L00PTIME] += (clockO - start); 

529 

530 > 

531 /* End update_G() +/ 

532 

533 

534 

535 /* ===================================================================== ♦/ 

536 



537 


main(){ 








536 










539 


double *cbuf; 


/* 


column buffer holds one col of A 


*/ 


540 










541 


Double_Matrix_Type *A; 


/* 


this node's portion of the matrix A 


*/ 


542 










543 


int cubesize, 


/* 


number of processors in the cube 


*/ 


544 


dim. 


/* 


dimension of the hypercube 


*/ 


545 


gc. 


/* 


global column number 


*/ 


546 




/* 


generic integer and row ctr 


*/ 


547 


j. 


/* 


generic integer and col ctr 


*/ 


546 


k, 


/* 


index to pivot 


*/ 


549 


m, 


/* 


number of rows in A (same local/all) 


*/ 


550 


me , 


/* 


id of this processor 


*/ 
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551 


my_cols = 0, 


/* 


number of cols in 


local portion of A */ 


552 


n, 


/* 


number of cols in 


all ol A */ 


553 


root , 


/* 


host/root processor id */ 


554 


timing ; 


/♦ 


Boolean 


*/ 


555 










556 


long sizeof _col, 


/* 


sizes, in bytes 


*/ 


557 


sizeof _int , 








558 


sizeof _pivot ; 








559 










560 


ticks start, 








561 


starti; 


/* 


another start 


*/ 


562 










563 


Pivot_Type pivot; 








564 










565 










566 


/* T 1 


u ttt 


IT T7ATTniI UflPV — ■ 




567 


/* ~ ~ 11 


N 1 1 1 i 


KL1 LA 1 1 UH HU rviS. - * 1 




568 










569 


for (i = 0; i < HAX.EVENTS; 


i++) t[i] = 0; 




570 










571 


start = t [START_TIME] = clock () 


* 




572 










573 










574 


#if def TRANSPUTER 








575 










576 


cubesize = CUBESIZE; 








577 


dim = DIMENSION; 








576 


initialize_hypercube (dim) ; 








579 










580 


#else 








561 










582 


cubesize = (int) nuxnnodesO 


» 






583 


dim = (int) nodedimO; 









564 

585 #endif 

586 



587 

588 

589 

590 

591 

592 

593 

594 

595 

596 

597 

598 

599 

600 



t [DATA.SOURCE] = me = (int) mynode(); 
root = (int) myhostQ; 

sizeof_int = (long) sizeof (int); 

sizeof _pivot = (long) sizeof (Pivot_Type) ; 



/* 

* 

* 

* 

* 

♦ 

* 



BROADCAST THE SIZE(A) 



All node processors need to know the number of rows and columns in 
the matrix A [i.e., size(A)] . A broadcast to the entire cube, 
cubecast() , is used to achieve this. The nodes also need to know 
whether or not to set timing on, so this value is passed too. 
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601 

602 

603 

604 

605 

606 
607 
606 

609 

610 
611 
612 

613 

614 

615 

616 
617 
61 6 

619 

620 
621 
622 

623 

624 

625 

626 
627 
626 

629 

630 

631 

632 

633 

634 

635 

636 

637 
636 

639 

640 

641 

642 

643 

644 

645 

646 

647 
646 

649 

650 



*/ 



#ifdef TRANSPUTER 



cubecast(me, dim, (char *) to, 
cubecast(me, dim, (char *) to, 



sizeof_int, cubeBize); 
sizeof_int, cubesize) ; 



cubecast(me, dim, (char *) Atiming, sizeof_int, cubesize); 
#else /♦ iPSC/2 */ 



cubecast(me, dim, (char *) to, 
cubecast(me, dim, (char *) to. 



sizeof _int, R0V_SIZE_TYPE) ; 
sizeof _int , C0L_SIZE_TYPE) ; 



cubeca8t(me, dim, (char *) Atiming, sizeof_int, ARG_TYPE) ; 
#endif /* TRANSPUTER ♦/ 

sizeof _col = (long) (m * sizeof (double) ) ; 



/* COLUMN BUFFER AND COUNTER 

♦ 

* The column buffer, cbuf [] , will be used to hold one column of A at 

* a time. We will see cbuf [] used on a variety of occasions when we 

* must work with a column of A. Allocate cbuf [] and determine the 

* number of columns that will be stored locally (my_cols) . 

* 

*/ 

cbuf = (double *) malloc(sizeof_col) ; 

for (i = 0; i < n; i++) { if ( ( i */, cubesize) == me) my_cols++; > 



/* ESTABLISH LOCAL A 

* 

* Allocate storage space for this node's part of A (it is called A 

* even though it is only part of A). 

*/ 

A = matalloc(m, my_cols); 
t [SETUP] = clockO - start; 
start = clockO ; 

receive_columns(dim, me, A, n, cbuf, my_cols, sizeof_col) ; 
t[DISTRIB.C0LS] = clockO - start; 



/* BEGIN ITERATION 
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651 

652 

653 

654 

655 

656 

657 
656 

659 

660 
661 
662 

663 

664 

665 

666 
667 
666 

669 

670 

671 

672 

673 

674 

675 

676 

677 
676 
679 
660 
661 
662 

663 

664 

665 

666 



* 1 . ) At the top o f the for() loop we have just completed update.GQ, 

* so the local candidate lor the next pivot is situated in np[0] . 

* The function elect.next. pivot () performs a series of directional. 

* exchangeOs so that all local candidates compete in an election 

* process. The winner is np[0]. 

* 

* 2.) If all went well, np[0] contains the next pivot. This informa- 

* 

* 3.) If this node has the pivot column [if (p[k] == gc)] , it must 

* divide everything under the pivot by the value of the pivot and 

* distribute the column to all other nodes (node zero sends to host). 

* 

* 4.) Finally, this node must perform the computations across the 

* Gauss Transform area for the local portion of A. The 

♦ update. G() function also locates the next pivot without special 

♦ expense. Then it is time to go back to the top of the loop. 

*/ 

start = clockQ; 

for (k = 0; k < (MIN(m,n)); k++) { 

pivot, id = k */, cubesize; 
pivot. t = k; 

/* know id; k ==> t; need s, u */ 

if (pivot. id == me) locate.pivot (me , cubesize, A, k, Apivot); 
cubecast_from(pivot . id, me, dim, (char *) Apivot, sizeof .pivot) ; 
if (me == 0) { 



starti = clockQ; 



667 #ifdef TRANSPUTER 



666 

669 8end(root, (char *) Apivot, s izeof.pivot , cubesize); 

690 

691 #else /♦ iPSC/2 */ 

692 

693 send(root, (char *) Apivot, sizeof.pivot , PIV0T.TYPE); 

694 

695 #endif /♦ TRANSPUTER ♦/ 

696 

697 t [PIV0TS.T0.H0ST] += (clockQ - starti); 

696 > 

699 

700 



swap_rows(A, k, pivot. s); 




gfppnode.c 



701 

702 

703 

704 

705 

706 

707 
706 

709 

710 

711 

712 

713 

714 

715 

716 

717 
716 

719 

720 

721 



starti = clockO; 
if (pivot. id == me) { 

do_pivot_column_arithmetic(A , cbuf, k, me, cubesize); 

> 

t[PCOL_ARITHMETIC] += (clock() - starti); 
starti = clock(); 

cubecast_f rom(pivot . id, me, dim, (char *) cbuf, sizeof.col); 
t [PC0L_DISTRIB] += (clock() - starti); 

if (me == 0) { 

starti = clock(); 



722 #if def TRANSPUTER 



723 

724 

725 

726 

727 
726 

729 

730 

731 

732 

733 

734 

735 

736 

737 
736 

739 

740 

741 

742 

743 

744 

745 

746 

747 
746 

749 

750 



submit(me, dim, (char *) cbuf, sizeof_col, cubesize); 
#else /* iPSC/2 */ 

submit (me, dim, (char *) cbuf, sizeof_col, PC0L_TYPE); 
#endif /* TRANSPUTER */ 

t [PC0LS_T0_H0ST] += (clock() - starti); 



starti = clockQ; 

update_G(A, cbuf, cubesize, k, me, n, Apivot) ; 
t [UPDATING_G] += (clockO - starti); 



/* END ITERATION [for(k...)] 

t [ITERATION] = clock () - start; 

f ree(cbuf ) ; 

t [STOP] = clockO; 

if (timing) submit_timing_data(me , dim); 



*/ 



356 



gfppnode.c 



751 retum(SUCCESS) ; 

752 > 

753 /* ============ EOF gfppnode.c 
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gfpcnode.c 



1 /* 

2 * 

3 * SOURCE 

4 * VERSION 

5 ♦ DATE 

6 ♦ AUTHOR 

7 ♦ REMARKS 
s * 

9 * 

10 */ 



======= PROGRAM INFORMATION ========== 



gfpcnode . c 
2.3 

17 September 1991 

Jonathan E. Hartman » U. S. Naval Postgraduate School 
See gf.h. 



u 

12 #include <math.h> 

13 

14 #ifdef TRANSPUTER 



15 

16 #include <conc.h> 



17 



is #include 

19 #include 

20 #include 

21 #include 

22 #include 

23 #include 

24 #include 

25 #include 

26 



Cmatrix . h> 
<macros . h> 
<allocate . h> 
<comm . h> 
<generate . h> 
<mathx .h> 
<ops .h> 
ctiming . h> 



27 #else 



28 

29 #include 

30 #include 

31 #include 

32 #include 

33 #include 

34 #include 

35 #include 

36 #include 

37 #endif 

38 

39 #include 

40 



"/usr/hartman/matlib/matrix . h" 
"/usr/hartman/matlib/macros . h" 
M /usr/hartman/matlib/allocate .h" 
"/usr/hartman/mat lib/comm . h" 
M /usr/hartman/matlib/generate .h" 
"/usr/hartman/matlib/mathx .h" 
"/usr/hartman/mat lib/ops .h" 
"/usr/hartman/matlib/timing. h" 



gf.h" 



41 #ifdef TRANSPUTER 



42 

43 Channel ^ic [(CUBESIZE + 1)3 , 

44 *oc [(CUBESIZE + 1)3; 

45 

46 #endif 

47 



48 

49 ticks t [MAX.EVENTS] ; 

50 



35S 



gfpcnode.c 



51 /* = = ========= FUHCTION DEFIKITIOH =========== 

52 * 

53 ♦ After this node finds its candidate for next pivot, there must be a 

54 * comparison with all other nodes. The local candidate starts in np[0] . 

55 * Direction-by-direction, candidates are exchanged and the winner is 

56 * positioned in np[0] . If there is a tie, the candidate from the smaller 

57 * node number wins. A RAHK_DEFICIENT opponent is ignored (the local 

56 * candidate must be at least as good). In the end, all processors have 

59 * identical entries in np[0] . 

60 ♦/ 

61 

62 #if def PROTOTYPE 



63 

64 void elect_next_pivot (int me, int dim, Pivot_Type *np) 

65 

66 #else 

67 

66 void elect_next_pivot (me , dim, np) 

69 

70 int me , 

71 dim; 

72 Pivot_Type *np; 

73 

74 #endif 

75 { 

76 int dir; 



77 

76 

79 

60 

61 

62 

63 

64 

65 

66 
67 
66 
69 

90 

91 

92 

93 

94 

95 

96 

97 



long cubesize = pow2(dim), 

len = sizeof (Pivot_Type) ; 



for (dir = 1; dir < (int) cubesize; dir <<= l) { 
if (dir != 8) { 

directional_exchange(me , dim, dir, (char *) A(np[l]), 

(char *) *(np[0] ) , len) ; 

> 

else { 



if ((me */, 8) != 0) { /* we don't want 0 < — > 8 comm */ 



> 



> 



directional_exchange(me , 



dim, dir, 



(char *) *(np[l] ) , 

(char *) lt(np[0]), len); 



96 

99 

100 



359 



gfpcnode.c 



101 

102 

103 

104 

105 

106 
107 
106 

109 

110 
111 
112 

113 

114 

115 

116 
117 
116 

119 

120 
121 
122 

123 

124 

125 

126 
127 
126 

129 

130 

131 

132 

133 

134 

135 

136 

137 
136 

139 

140 

141 

142 

143 

144 

145 

146 

147 
146 
149 



if (np [1] . id ! = RANK.DEFICIENT) { 

if (fabs(np[l] .u) > f abs (np [0] .u) ) { 

np[0].id = np[l].id; np[0] .u = np[l].u; 

np[0].s = np[l].s; np[0].t = np[l].t; 

> 

else { 

if (fabs(np[l] .u) == fabs (np [0] . u) ) { 

if (np[l].id < np[0].id) { /* smallest breaks tie */ 

np[0].id = np[l].id; np[0] .u = np[l].u; 

np [0] . s = np[l].s; np[0].t = np[l].t; 



> /* end if (np[l] . id. . . . ) */ 
> /* end for(dir) */ 



/* Since there is no direct connection between nodes 0 and 8, we once 

* again destroy the beauty and generality of the hypercube so that we 

* can be sure that 0 and 8 have the best candidate for pivot. 

*/ 



if (dim == 4) { 

if ((me */. 8) == 0) { /* Nodes 0000 and 1000 

directional_receive(me, dim, 0x1, (char *) np, len) ; 

> 

if ((me */, 8) == 1) { /* Nodes 0001 and 1001 

directional_send(me , dim, 0x1, (chax *) np, len); 



*/ 



*/ 



/* End elect_next_pivot () 



*/ 



/* This is only the first part of this file. The rest would be similar to 

♦ gfppnode.c 

♦ 

* ============ EOF gfpcnode.c =========== */ 
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